Got BDE?⚡️
Hi BDEs!🤓 I’ve had an amazing week so far, and I hope you have too! Today we’re learning all about Subqueries🔗— if you missed the intro on CTEs, you’ll want to catch up HERE🔗 first. Last week, I hung out with tons of amazing content creators and made some collab videos. I had a lot of fun, and this week I’ve been back in the grind, so here’s a technical newsletter🤩
👀Check out how cool Pecan AI🔗 is! The tool helps data analysts build ML models in minutes using SQL using their ChatGPT-like interface. The predictions are extremely accurate and even work with messy data! Watch the demo HERE🔗! Pecan AI is sponsoring this newsletter, so show them your support🤩
Dataset of the Week⚡️
YouTube Sports Channels Statistics Dataset🔗 from Kaggle🔗
Project: YouTube Views using Subqueries
Using a subquery, identify the max number of rows for a single channel_id. Hint: you can do this by counting the number of rows per channel_id in a subquery and then pulling the max number of rows from your subquery result. This will tell you what the maximum number of rows is for a single channel_id out of all the channel_ids (the correct answer should be 2 by the way!). But we won’t know which channel_ids have the max number…
New question: Using a subquery, identify which channel_ids have 2 rows. Hint: you’ll have to filter the original dataset using a subquery that pulls the max number of rows for a channel_id.
Rewrite your subqueries in #1 and #2 as CTEs to practice!
Bonus Question: pull the duplicate channel_ids using HAVING instead of subqueries or CTEs.
Share with a data friend who you want to do this project with you❤️
Social Highlights⚡️
Subqueries: The basics
Learning about subqueries goes along well with CTEs because almost every CTE can be written as a subquery and vice versa. If you missed the article on CTEs 2 weeks ago, catch up HERE🔗.
Subqueries🔗 are a query nested inside of another query. Technically you can have subqueries in subqueries in subqueries in subqueries… and so on… but don’t do this to your coworkers LOL! They can also be turned inside out and written as a CTE (I’ll show you an example of this later!) which greatly increases readability and interpretability. I highly recommend using CTEs for multiple nested subqueries or really complicated/long subqueries for better code organization and readability.
Subqueries vs. CTEs
A few differences between subqueries and CTEs are that:
Subqueries cannot be reused throughout the query unlike CTEs. CTEs can be referred to multiple times within the same query, but subqueries cannot be reused.
You don’t have to name your subquery (although I recommend doing so with an alias) like you do a for CTE.
Subqueries can’t be written recursively like CTEs. If you’re not familiar with this, don’t worry, it’s not used much!
Subqueries are generally harder to interpret since a reader has to start on the inside of the query and work their way out to understand it. I highly recommend using CTEs for more complex queries because too many subqueries can greatly decrease readability and ease of maintenance.
Subqueries in the WHERE clause
A really common example of a subquery is using it in the WHERE clause of the outer query. You use the subquery for conditional logic by using an operator and then filtering the original dataset for that condition. Check out this example below with an aggregate function, min( ):
select *
from orders
where
order_date =
(select min(order_date) from orders) --this is the subquery
-- you can also use other operators here, not just =
Here, the query is being filtered for all orders that have the minimum order date, which is the condition. They can also be used to filter for a column (or a list) in the where clause like this:
select *
from orders
where
order_id IN
(select order_id from payments) --this subquery produces 1 column
Here we’re selecting all the order IDs in the payments table and then filtering the orders table for those orders. This gives us all the orders data for the orders that are in the payments table. Why would we use this instead of manually adding in the IDs we want to search for like this?
select *
from orders
where
order_id IN
('7364', '3820', '9818',...)
Imagine you’re working with a HUGE dataset with thousands or millions of rows… It wouldn’t be convenient to type in or even copy and paste that many IDs into a query. Plus, you could even explode your IDE (not literally) by trying process all those!
Adding in a manual list requires you to figure out which ones to add ahead of time instead of them being added in automatically. This adds in an extra manual step into your process.
Perhaps the most important reason is that we don’t want to hardcode in ID values unless it’s an ad hoc request that’s never going to be used again. It’s better to use a subquery here because the list will dynamically change and refresh every time you run the query. We love fresh data!
Subqueries are great for quick 1-liner queries to use conditional logic in the WHERE clause!
Subqueries in the FROM clause
Subqueries can be used in the FROM clause if you need to calculate something first and then pull from that result. For example, since you can’t nest aggregate functions in SQL, you have to calculate them in 2 separate SELECT statements—using either a CTE or subquery for the first calculation and then the final select statement to nest the second calculation around the first one. Here’s an example with a subquery where we’ll find the maximum average temperature per month:
Keep reading with a 7-day free trial
Subscribe to Big Data Energy⚡️ to keep reading this post and get 7 days of free access to the full post archives.