Row_Number() vs. Rank() vs. Dense_Rank() Easily Explained!
One of the most commonly-asked interview questions!
Got BDE?⚡️
Hi BDEs! Hope you’ve had an amazing week. Today we’re breaking down a SUPER common interview question: What’s the difference between row_number(), rank(), and dense_rank()?
Amazing $$$ Opportunity!💰
IBM is offering the 2024 Call for Code Global challenge with a $1000 prize! 💰
All you have to do is go to THIS LINK🔗, make an account, and go to “Start with Free Training.” Here, you can begin to upskill and learn FOR FREE. If you complete all 6 courses by October 17 and take the Call for Code AI Knowledge Challenge Quiz, you could win $1000! Whoever gets a perfect score (or the top score) by October 31 will be the lucky winner. I REALLY hope it’s one of my followers, so go check it out 😉
Dataset of the Week⚡️
This week’s dataset is a Taylor Swift Spotify dataset. Am I still coping from my cancelled Vienna concert?! Maybe 🥲
Row Number vs. Rank vs. Dense Rank
Calculate the row number, rank, and dense rank for the danceability scores rounded to one decimal place for the album Lover.
What’s different about the calculations and which would you use if you were making a playlist with the top 3 most danceable songs?
Social Highlights⚡️
Row_Number(), Rank(), & Dense_Rank() Easily Explained!
A lot of business problems rely on ranking methods in SQL:
Most recent order data per customer
First website session per user
Top 3 salaries per department
And the list goes on! More often than not, I’m looking for the most recent or first of something which requires a rank number of 1 (either in descending or ascending ranking order), but sometimes I need a different rank number like for question #3 above. And 9 times out of 10, Row_Number() suffices for a quick rank because I don’t usually care too much about ties, but sometimes you do have to consider tie handling which requires rank() and dense_rank(). Regardless of how little I use these compared to Row_Number(), it’s really good to understand the differences in these calculations because this is one of the most common SQL interview questions.
Seriously, interviewers LOVE asking about the differences between these, so you better know them well and be able to explain them!!
Row_Number()
Let’s start with Row_Number() since it’s the easiest to understand and the most widely-used. Row number provides a unique sequential number to each row and doesn’t account for ties/duplicates. It doesn’t repeat or skip any ranking numbers, and it’s non-deterministic, which means that you might not always get the exact same result since it arbitrarily handles ties. Check out the query and output below which ranks Taylor Swift songs on the 1989 album by danceability (rounded to 1 decimal place) descending, which means that the most (rounded) danceable songs will be at the top.
select
album,
"Song Name",
Danceability,
round(Danceability, 1) as Danceability_rounded,
row_number() over(/*partition by Album*/ order by Danceability_rounded desc) as danceability_rownumber
from
taylor
where Album = '1989'
;
Even though there are ties for the rounded danceable score, each song has its own unique row number. Notice how the ties (0.8) are completely ignored.
Also note that I have commented out
partition by Album
for simplicity. This is not needed since I filtered the query for the album 1989 and there’s only 1 album. If I removed the 1989 album filter and wanted to repeat this ranking process for each album, then I would use the partition by inside the window function before the order by to specify how to segment the data for the calculation.
Rank()
Now let’s see how Rank() is different. Rank() ranks each row in the dataset and allows for ties with duplicates. After a tie, it skips to the next row number, so the list of numbers will have gaps and won’t be complete if there are ties. Notice how all of the 0.8 rounded danceability scores are all tied for a rank value of 1, and then the next rank number after that is 5 to account for the 4 above it in ranking.
select
album,
"Song Name",
Danceability,
round(Danceability, 1) as Danceability_rounded,
rank() over(/*partition by Album*/ order by Danceability_rounded desc) as danceability_rank
from
taylor
where Album = '1989'
;
Dense_Rank()
Now for dense_rank()… it’s ALMOST like rank() except it doesn’t leave rank number gaps by skipping to the next row number after ties/dups. Dense_rank() uses the next rank number in the sequence after a tie. See the 0.8 values tied for 1 below? Then after that is 2 for the next rank value? This is how dense_rank() and rank() differ.
select
album,
"Song Name",
Danceability,
round(Danceability, 1) as Danceability_rounded,
dense_rank() over(/*partition by Album*/ order by Danceability_rounded desc) as danceability_denserank
from
taylor
where Album = '1989'
;
Put them all together!
I personally think it’s easiest to understand them all when they’re in the same query and output side by side.
Apologies for the ugly copy and paste below. Find the commas to see the different rows in the code!!
select
album,
"Song Name",
Danceability,
round(Danceability, 1) as Danceability_rounded,
row_number() over(/*partition by Album*/ order by Danceability_rounded desc) as danceability_rownumber,
rank() over(/*partition by Album*/ order by Danceability_rounded desc) as danceability_rank,
dense_rank() over(/*partition by Album*/ order by Danceability_rounded desc) as danceability_denserank
from
taylor
where Album = '1989'
;
Knowing how to rank well in SQL is a MUST, but knowing how to explain the differences between row_number(), rank(), and dense_rank() are even more important when it comes to interviews.
But for real, where them girls at?! ⬇️
-xoxo,
jess💕
WHO AM I?⚡️
I'm Jess Ramos, a content creator, instructor, and leader in the data world. I'm super passionate about data and especially SQL! My 9-5 is being a Senior Data Analyst @ Crunchbase, a top tech startup that specializes in private company data. I'm also an Instructor with LinkedIn Learning and the founder of Big Data Energy Analytics. My 5-9 (aka outside of my regular job!) is creating amazing content for YOU on LinkedIn, Instagram, TikTok-- and this newsletter!🤩
Skeptical*
Hi Jess, thank you for explaining this with such a great example. I wanted to try this solution out, but as skeptics w.r.t query size, will it end up utilising a lot of CPU to run this?
Are there ways to optimise this especially if we have a large database?