Got BDE?⚡️
Hi BDEs! Don’t forget the Presale for Big SQL Energy is October 29! It will only be available for presale for 1 week only so stay tuned 👀
Do you love data observability and/or wanna win 2 free tickets to the Formula 1 Las Vegas Grand Prix? 🏎️ Then I have the perfect free learning summit🔗 for you!
Monte Carlo is hosting a free virtual event on how to drive data & AI trust at scale called Impact. And they’ve even brought on an OLYMPIAN, Allyson Felix, to be the keynote and discuss how she leveraged data to improve her performance and secure her first gold.
If you want to learn about how AI will evolve in 2025, data reliability in Gen AI, and implementing data products, you cannot miss this event! Sign up here!🔗
Dataset of the Week⚡️
Here’s some subquery practice!
YouTube Sports Channels Statistics Dataset🔗 from Kaggle🔗
Project: YouTube Views using Subqueries
Using a subquery, identify the max number of rows for a single channel_id. Hint: you can do this by counting the number of rows per channel_id in a subquery and then pulling the max number of rows from your subquery result. This will tell you what the maximum number of rows is for a single channel_id out of all the channel_ids (the correct answer should be 2 by the way!). But we won’t know which channel_ids have the max number…
New question: Using a subquery, identify which channel_ids have 2 rows. Hint: you’ll have to filter the original dataset using a subquery that pulls the max number of rows for a channel_id.
Rewrite your subqueries in #1 and #2 as CTEs to practice!
Bonus Question: pull the duplicate channel_ids using HAVING instead of subqueries or CTEs.
Subqueries: Untangling Correlated vs. Uncorrelated
What’s a Subquery?
In a nutshell: Subqueries🔗 are a query nested inside of another query. They can also usually be turned inside out and written as a CTE which greatly increases readability and interpretability. I highly recommend using CTEs for multiple nested subqueries or really complicated/long subqueries for better code organization and readability. If you missed the article on CTEs a few weeks ago, catch up HERE🔗.
If you missed my article on Intro to Subqueries, I highly recommend catching up HERE🔗 for the basics on subqueries in the from and where clauses! All of these examples here are uncorrelated subqueries, which are easier to understand (in my humble opinion).
Uncorrelated Subqueries:
Uncorrelated subqueries are easier to interpret because they are independent from the outer query since they don’t have any links, references, or dependencies to the outer query. It can run independently from the outer query. In this case, the subquery is run first, a single result set is computed, and then the output is used in the outer query in the second step. They can usually be turned “inside out” and written as CTEs.
Here’s a simple example that pulls the orders that are in the payments table:
--using a subquery
select *
from orders
where
order_id IN
(select order_id from payments)
And another more complex example shown also as a CTE:
--with a Subquery
select
max(avg_temp) as max_avg_temp
from
(select
month,
avg(temperature) avg_temp
from
temperature_log
group by 1
)
--with a CTE
with avgs as(
select
month,
avg(temperature) avg_temp
from
temperature_log
group by 1
)
select
max(avg_temp) as max_avg_temp
from
avgs
Correlated Subqueries:
Correlated subqueries are different because they depend on the outer query for its values— they’re dependent on it! This is the difference from uncorrelated subqueries.
-- employees who have a salary greater than the average
SELECT e.employee_id, e.name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e.department_id
);
This query is correlated because the where clause relies on the department_id
of the e2
table in the outer query. The inner query can’t be run independently without the outer query! Here’s another example that ChatGPT wrote for me, but it’s pretty disgusting to look at LOL!
-- Side note: can you identify ChatGPT's error below? 👀
SELECT c.customer_id, c.name
FROM customers c
WHERE (
SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id
) > (
SELECT AVG(order_count)
FROM (
SELECT COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
) AS avg_orders
);
We can see how the subquery before the >
in the where clause relies on the outer customer table which makes it a correlated subquery.
Well, I’ll c u next tuesday!
-xoxo,
jess💕
WHO AM I?⚡️
I'm Jess Ramos, a content creator, instructor, and leader in the data world. I'm super passionate about data and especially SQL! My 9-5 is being a Senior Data Analyst @ Crunchbase, a top tech startup that specializes in private company data. I'm also an Instructor with LinkedIn Learning and the founder of Big Data Energy Analytics. My 5-9 (aka outside of my regular job!) is creating amazing content for YOU on LinkedIn, Instagram, TikTok-- and this newsletter!🤩