Top 10 Tips to Optimize SQL Queries
Don't be a chump and write slow, crappy code. Refactor and optimize 🤩
Got BDE?⚡️
Hi BDEs!⚡️ Hope you’ve had an amazing week. I’m so excited because I just hit 200K followers on LinkedIn!! WOOHOO!!! 🥳 Whether you’re a new follower or a long time follower (or you don’t even follow me on LinkedIn), thanks SO MUCH for all the support! ❤️ Y’all are DA BEST!!!
Social Highlights⚡️
SQL Optimization Tips
SQL is more than just writing a query without errors… Sure, you can write a query that returns an output, and maybe you love the phrase, “if it ain’t broke, don’t fix it!”— but in the world of data, slow and inefficient queries can take up a lot of company resources. Not only do analysts and stakeholders waste time sitting and waiting, but many cloud platforms and data warehouses charge by the processing time and/or data processed. So why would you want to process more data than you need?!
Here’s 10 query optimization tips:
1. Avoid SELECT * and instead list desired columns
Selecting ALL columns is kind of lazy to be honest. Do you actually need ALL the columns? Probably not. I’ve seen tables with hundreds of columns, and there’s a huge difference in how long they take to process. Don’t select * especially if you don’t know how many columns there are. The only exception is if you select * and limit 100 to get a good feel for the data available.
select * from table
select column1, column2 from table
2. Use INNER JOINs over LEFT JOINs when applicable
Of course if you need a left join, DO THE LEFT JOIN. Don’t take my words out of context here. There are legit use cases where you have to do a left join or even a full outer join, but in general, you want to filter out as much data as possible in your join. If you really only need an inner join, don’t keep the extra data by using a left join “just in case”. We’ve all been there and done that. Be as conservative as possible, and even filter out data in the ON clause of the join when you can. From/Join is the first step in the order of execution, so it’s super powerful to cut down on data in this step vs. waiting for WHERE and later steps.
select *
from
table1 t1
join
table2 t2
on t1.id = t2.id
and t2.date >= '2024-06-01'
3. Use WHERE and LIMIT to filter rows
If you’re just exploring the data to see what you have and gauge the possible values, use LIMIT instead of pulling in every. single. row. It’s okay if you even use LIMIT 1000— this is still better than no limit! I do this when I’m writing my query and brainstorming and want to cut down on run time to test a few things. Similarly, with WHERE, filter out any unnecessary rows. Maybe you don’t really need every product or customer. Or maybe you only need the past 3 months or 12 months instead of pulling in 5+ years of data. Filtering data with where can provide more context to the query and drastically cut down on the number of rows you’re processing.
4. Filter as much as possible as early as possible (consider the order of execution)
This is the best rule of thumb for SQL optimization. Always filter out as much data as possible (of course considering the context and business question) and filter it in as early of a step as possible in the order of execution. Don’t hang on to that unneeded data any longer in the query execution than you need to or you’re just processing data through extra steps for no reason.
-- SIDE QUEST: Order of Execution:
From/Join
Where
Group By
Having
Select
Order By
Limit
5. Avoid ORDER BY (especially in subqueries and CTEs)
Order By takes a lot of time to run because it’s literally sorting every single row. Don’t use it unless you really need it or if you’re just ordering a small list of things for your final output. Never put order by in a CTE or subquery (pretty sure you’ll get an error in subqueries in most SQL dialects) because there’s no point in sorting your data in the middle of the process when it’s not even your final visible result. Of course you do need order by in window functions though, but that’s different because it’s supporting a calculation and not just for funzies.
6. Avoid using DISTINCT unless necessary (especially when it’s already implied like in UNION)
Do you really need DISTINCT or are you just “making sure”. Yes, sometimes you do need to use distinct as a safety net, but sometimes you know there’s no possibility for dups, so asking SQL to check and dedup is a waste of processing time. Don’t forget about when distinct is implied like in UNION. If you know there aren’t dups (or they don’t affect your analysis, just use UNION ALL!
7. Use CTEs when you’ll have to refer to a table/ouput multiple times
If you’re referring to the same output / calculation multiple times, throw it in a CTE instead of a subquery or recreating it every time. CTEs can really speed up your queries when used well because SQL only has to process it once but it can be used an infinite number of times.
8. Avoid using wildcards at the beginning of a string (‘%jess%’ vs. ‘jess%’)
Wildcards can be inefficient especially if SQL has to check for patterns in the middle of a string. If SQL looks for the pattern ‘jess%’, it only has to check for a match for the first 4 letters of each string, and then it doesn’t care what’s next. If SQL has to look for a pattern after a wildcard in a string like for ‘%jess%’, SQL has a lot more work to do to figure out if that pattern is there or not. It’ll likely have to go through much more (if not all) of the string to see if it matches or not. So never use wildcards at the beginning of a string if you know the string you’re looking for starts a certain way. In other words: if you’re looking for Jess___, use ‘jess%’ NOT ‘%jess%’ for better efficiency.
select customer_name, customer_id
from customers
where customer_name like 'jess%'
-- customer_name like '%jess%'
9. Use EXISTS instead of COUNT and IN
If you’re counting in a subquery or checking to see if something is in the subquery or not, sometimes EXISTS can to the job to see if it exists (i.e. is there or not) which is a lot faster than counting every single instance. You don’t care how many instances of something there are if you’re just looking to see if it is there at all (at least once). Using EXISTS strategically can be a great way to speed up processing time compared to count(*) and IN!
SELECT customer_id, name
FROM Customers
WHERE (SELECT COUNT(*)
FROM Orders
WHERE Orders.customer_id = Customers.customer_id) > 0;
VS.
SELECT customer_id, name
FROM Customers
WHERE EXISTS (SELECT 1
FROM Orders
WHERE Orders.customer_id = Customers.customer_id);
SELECT customer_id, name
FROM Customers
WHERE customer_id IN (SELECT customer_id
FROM Orders);
VS.
SELECT customer_id, name
FROM Customers
WHERE EXISTS (SELECT 1
FROM Orders
WHERE Orders.customer_id = Customers.customer_id);
10. Avoid complex logic
Complex logic can take forever to run! Refactor and optimize.
Obviously you can’t ALWAYS avoid these, and they each have their use cases, but these are good things to think about when optimizing your queries.
I wish I was in London this week for Big Data London, but I’ll have to cope with Fergie instead ⬇️
-xoxo,
jess💕
WHO AM I?⚡️
I'm Jess Ramos, a content creator, instructor, and leader in the data world. I'm super passionate about data and especially SQL! My 9-5 is being a Senior Data Analyst @ Crunchbase, a top tech startup that specializes in private company data. I'm also an Instructor with LinkedIn Learning and the founder of Big Data Energy Analytics. My 5-9 (aka outside of my regular job!) is creating amazing content for YOU on LinkedIn, Instagram, TikTok-- and this newsletter!🤩
Super relevant and helpful, thanks Jess! I was actually just asked about how to optimize my SQL queries in a job interview this morning.
Good stuff, Jess. I would add Window functions to the order of execution. Happens just before theorder by clause.