Got BDE?⚡️
Hi BDEs! Welcome to Lesson 16 of the free Big SQL Energy Beginner Course⚡️ Today we’re learning about Count(*) vs. Count(Column) in SQL!
Return to Big SQL Energy Beginner homepage here to access the other lessons:
If you’re new here:
💁🏽♀️ Who Am I?
I’m Jess Ramos, the founder of Big Data Energy and the creator of the BEST SQL course and community: Big SQL Energy⚡️. Check me out on socials: 🔗LinkedIn, 🔗Instagram, 🔗YouTube, and 🔗TikTok. And of course subscribe to my 🔗newsletter here for all my upcoming lessons and updates— all for free!
⚡️ What is Big SQL Energy?
🔗Big SQL Energy Intermediate is an intermediate SQL course designed to solve real-world business problems hands-on using realistic data and a modern tech stack. You’ll walk away with 2 SQL portfolio projects, portfolio building guidance, access to the Big Data Energy Discord community, and lots of confidence for you upcoming coding interviews! You also get access to all monthly events including masterclasses, office hours, and guest speakers. This course is all of the most important things I’ve learned on the job as a Senior Data Analyst in tech who grew from a $72K salary to over a $150K salary. And now I’m sharing it with YOU! It’s more than just another course— it’s a challenging program designed to upskill your SQL to the intermediate level, get ready to ace your live coding interviews, and build connections in the data community through the Discord and events.
Lesson 16: Count(*) vs. Count(Column)
In the last lesson, we learned aggregate functions, and I introduced you to the count() function which can be useful for counting things in SQL. In the last lesson, we learned how to count non-null values in a column by putting the column name inside the function, but there’s another to count in SQL as well…
1. Count(column):
Counts all the non-null values in a column.
select
count(order_id) as order_count
from
orders
Count(column) is a selective count because it only counts the non-null values in a column which means that it skips, or excludes any nulls in a column. It can helpful to answer business questions like: “how many times did X actually happen” or “how many of X really exist?”.
Although it’s great for excluding null values, it DOES count duplicate values, so if a value shows up multiple times in an output or column, it WILL be included multiple times in the count. This means that you can easily overcount or double count the number of items if there are duplicate values in the column. Very often, this is fine! But it can greatly mess up your analysis if duplicates are something that will affect your business solution. So you always have to think about if you should be including them or excluding them from the count based on your scenario! If you do want the number of distinct values to answer your business question, you can make a slight modification to your count using distinct:
select
count(distinct order_id) as order_count
from
orders
Adding distinct in front of the column name will dedupe the column values before the count is performed which will remove duplicates and give you a count of unique values only without double counting. Notice that in the example above that I used distinct in front of order_id, which we know is the primary key of the table. What impact do we think distinct has here?
NONE! …Other than negative impacts to performance LOL! Since order_id is the primary key of the table, we already know it’s non-null, unique values by definition, so the distinct and non-distinct count will produce the same results. However, adding distinct as a “just in case” safety net is hurting performance and increasing run time since we’re asking SQL to take more time and energy to check for duplicates… even though there aren’t any! So adding distinct can be a great way to dedupe when using count(column), but don’t use it if it doesn’t make sense to.
It might make sense to count distinct customers that have placed orders like this since customer_id is a foreign key in the orders table and likely has dupes. Compare the differences below:
select
count(customer_id) as cust_count,
count(distinct customer_id) as cust_count_distinct
from
orders
2. Count(*):
Counts all the rows in a table or result set.
Count(*) does not look at duplicates or null values in any of the columns, so it’s the most careless and least restrictive way to count things in SQL.
It’s great if you want a total count of the number of rows in your dataset or if you want to count the primary key of a table— aka the number of rows or distinct items in your table! Since each primary key value is non-null and unique, it represents the number of X in the table— X being the granularity of the table. If the granularity of the table is orders for example, that means that every row is a different order uniquely identified with a primary key. So using count(*) to count the number of rows is going to be the same as counting the number of orders or count(order_id). See how count(*) is the same as counting the primary key below:
select
count(order_id) as order_count,
count(*) as num_rows --also order count!
from
orders
But if you want to count the number of unique customers in the orders table for example, count(*) would overcount by a lot since customers can place multiple orders and the customer_id column has duplicates. Compare the difference below:
select
count(distinct customer_id) as cust_count_distinct,
count(*) as not_correct_cust_count_distinct --INCORRECT CUST COUNT
from
orders
So really, when counting in SQL, you need to decide whether to use count(*), count(column) or count(distinct column), and it all depends on whether or not you want to just count rows, exclude null values, or exclude null values AND dedupe.
Here’s a quick summary:
count(*): Counts all the rows
count(column): Counts non-null values in a column
count(distinct column): Counts non-null values in a column & removes duplicate values
Mystery SOLVED! If you’ve enjoyed the free BEGINNER Big SQL Energy course, grab my intermediate course. There are no deadlines or timelines for this course— you start and take as long as you want. It will take your beginner skills to the next level and get you ready to ace your interviews. DM me or email me at courses@bdeanalytics.com if you have questions on the intermediate course!
Return to Big SQL Energy Beginner homepage here to access the other lessons: