Top 5 Data Cleaning Tips💰
Easiest ways to instantly get cleaner data
Got BDE?⚡️
Hi data baddies 💅🏼
Subscribe on YouTube 🎥— Just hit 12K subs 🎉
And BIG announcement… I’m planning some big, BIG Black Friday deals including a NEW COURSE! Make sure to add your name to the waitlist so you find out FIRST when the deals all launch. Spots are limited so get on the list!!
Today we’re learning about DIRTY DATA! Intro video here:
Top 5 Data Cleaning Tips in SQL
1. Removing quotes from text
In our dataset, we have names with single and double quotes in the name strings like ‘Jess’ or “John”.
Use nested REPLACE() to scrub both single and double quotes:
SELECT
REPLACE(REPLACE(first_name, ‘’‘’, ‘’), ‘”’, ‘’) AS clean_first_name
REPLACE(REPLACE(last_name, ‘’‘’, ‘’), ‘”’, ‘’) AS clean_last_name
FROM
customers;
Why all the quotes?‘”‘ means “find a double quote.”
‘‘‘ means “find a single quote.” But here you’ll get an error in SQL so you have to escape the single quote with another single quote like this: ‘‘‘‘. The outer quotes open and close the string, the double single-quotes escape the character you want removed. This is a niche example for single quotes!
2. Concatenate first + last names
Once you’ve cleaned the quotes, combine them into one field (line 4):
SELECT
REPLACE(REPLACE(first_name, ‘’‘’, ‘’), ‘”’, ‘’) as first_name_clean,
REPLACE(REPLACE(last_name, ‘’‘’, ‘’), ‘”’, ‘’) as last_name_clean,
concat(first_name_clean, ‘ ‘, last_name_clean) as full_name,
CONCAT(COALESCE(first_name_clean, ‘’), ‘ ‘,
COALESCE(last_name_clean, ‘’)) as full_name_clean,
FROM customers;
Add in lines 5 & 6: add COALESCE() if some values are NULL so you can still get back a useful result. If you’re not familiar with COALESCE, keep reading 😎
Concatenating fields may be useful if it’s the end result for a non-technical stakeholder but be cautious doing this in queries where the field will be used again because it can add to messiness later on! Especially if you’re joining or matching names.
3. Replacing NULL values
There are 2 main ways to replace null values in SQL (lines 2 & 3):
SELECT
ifnull(email, ‘Unknown’),
COALESCE(email, ‘Unknown’) AS clean_email
FROM customers;
IFNULL replaces null values with a default value. COALESCE works similar, but it returns the first non-null value in a list, so you can add multiple values to check for— not just one. And the last value works great as a default value.
4. Removing duplicate emails
First you have to know how to identify and check for duplicates. This is a REALLY common interview question so be sure you know how to do this!!!
SELECT
email,
COUNT(*) AS duplicate_count
FROM
messy_customers
GROUP BY
email
HAVING
COUNT(*) > 1;Use ROW_NUMBER() to remove duplicates based on a value (most recent, ID, any). IN this example, we’ll keep the first chronological ID per email since the email is acting like a unique identifier:
SELECT
customer_id,
email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY customer_id) AS row_num
FROM messy_customers
qualify row_num = 1
You can save the query with row_number as a CTE and then filter the row number for 1 in a where clause in a separate select statement below it. Or you can use qualify like I did above! Either way works, but I’m biased towards the qualify solution because it’s easier to understand.
5. Cleaning and standardizing dates
Some dates in this dataset are datetimes like 2023-05-12 14:23:00, and others are dates with extra quotes added to the string like “2023-05-12“. We want to convert and standardize all of them to regular dates with no extra quotes:
SELECT
REPLACE(REPLACE(signup_date, ‘’‘’, ‘’), ‘”’, ‘’)::date as date_clean,
FROM customers;
That’s a wrap on our data cleaning! Don’t forget to add your name to the Black Friday waitlist to make sure you get first access to all the deals when they drop! Spots are limited, so if you’re not sure, just get on the list just in case!
Bye BDEs 💅🏼
Jess Ramos 💕
⚡️ Social Highlights:
⚡️If you’re new here:
💁🏽♀️ Who Am I?
I’m Jess Ramos, the founder of Big Data Energy and the creator of the BEST SQL course and community: Big SQL Energy⚡️. Check me out on socials: 🔗YouTube, 🔗LinkedIn, 🔗Instagram, and 🔗TikTok. And of course subscribe to my 🔗newsletter here for all my upcoming lessons and updates— all for free!



