How to build synthetic datasets using AI
DON'T use Kaggle anymore pls!
Got BDE?⚡️
Hi data baddies 💅🏼
If you’ve been spending hours hunting for the “perfect” dataset on Kaggle…
I need you to stop. Right now. Because those datasets are unrealistically clean, boring, AND do absolutely nothing to set you apart in the job market, and in 2026, that matters more than ever.
The good news is you don’t need real data to build real, impressive projects. You need to know how to generate the RIGHT synthetic data! And that’s exactly what we’re covering today 💅🏼
So what even is synthetic data?
Synthetic data is an artificially generated dataset that mimics real-world patterns and structures. It looks real, feels real, it’s just not real. Think fake customer lists, mock order records, simulated transaction logs, or sample support tickets. The more realistic, the better.
And no, it’s not cheating. It’s actually the smarter move. 🧠
Why use synthetic datasets?
Privacy: Real data will get you into real trouble. 🙅🏻♀️ Laws like GDPR and HIPAA exist for a reason. You cannot just export real patient or customer data. You would literally get sued. And even if it’s not against the law, your employer is NOT going to love finding out you imported their database into an AI tool.
Realism: Real data is messy in ways that don’t serve your project. Kaggle datasets are the opposite; they’re SO clean that they don’t reflect the reality of production at all.
Customization: Synthetic data gives you the best of both worlds. You control the structure, the messiness, and the realism without risking anyone’s data or violating a single compliance policy. You can prompt and create exactly what you want.
The 5 things you MUST specify in your prompt
The BIGGEST mistake is being too vague. Remember, ChatGPT CANNOT read your mind. Typing “make me a practice data set” just gives you a random, generic mess that doesn’t serve your purpose at all. Your dataset is only going to be as good as your prompt, so here’s what you need to lock in:
1. Volume: Aim for 5,000 to 10,000 rows. Enough volume & variation to build a real analysis without slowing down your machine.
2. Schema: Define your tables, columns, and how they relate. If the customer ID lives in your orders table, it better map to your customers table. Otherwise, you'll have IDs that join to nothing, and you WILL lose your mind.
3. Constraints: Specify data types, allowed values, and logical boundaries. If you want adults only, say ages 18-100. If there is a required date range, define it. Don't assume ChatGPT will figure it out. You NEED to tell it everything.
4. Realism: This is what separates a crappy Kaggle dataset from something that actually feels production-grade. Ask for realistic distributions, logical field relationships (no prescribing chemo to someone with anxiety), randomized IDs, and anti join relationships. For example, you could specify that not every customer should have a matching order, because that's not how real businesses work.
5. Format: This is the one everybody forgets. Specify that you want a downloadable CSV, JSON, or whatever you actually need. Nothing is more frustrating than generating the perfect dataset and getting a wall of text you can't do anything with.
How to customize for your use case
Once you’ve nailed the five fundamentals, the real fun begins. 🎉 Make your synthetic dataset hyper-relevant to the industry and role you're targeting. Some examples would be:
Healthcare: Build around diagnoses or insurance claims.
E-commerce: Go with shopping behavior.
SaaS: Churn or feature usage.
Finance: Fraud detection.
The more specific to your industry, the better it looks in your portfolio 💅🏼 AND don’t forget to match it to the skill you’re actually trying to build. Just like in:
EDA: Ask for lots of variation in your values and interesting patterns & relationships
ML Model: Ask for an outcome variable to predict
Data Cleaning: Make it as DIRTY as possible. Missing values, outliers, typos, the works.
Remember: Real data is messy. Yours should be too.
Grab my customized prompt template 👇🏻:
You are a data engineer generating a realistic synthetic dataset for [INDUSTRY] and [PROJECT TYPE OR PURPOSE].
Can you generate [NUMBER] realistic datasets with the following requirements.
Create an [TABLE NAME] table with [ROW COUNT] rows and columns: [LIST REQUIRED COLUMNS], plus any additional realistic columns you think would be useful. [PRIMARY KEY] is the primary key. [FOREIGN KEY 1] and [FOREIGN KEY 2] are foreign keys that connect to the [RELATED TABLE NAME] table. Ensure that [NUMBER] foreign key values exist in the related table but do not appear in this table (to simulate missing relationships).
Create a [DIMENSION TABLE NAME] table with [ROW COUNT] rows and columns: [LIST REQUIRED COLUMNS], plus any additional realistic columns. [PRIMARY KEY] is the primary key and connects to the first table. Ensure that [NUMBER] records in this table have no matching rows in the first table.
For both tables, include high variation across values, non-even category distributions, and realistic data patterns. All ID fields should be random numeric values only (no letters).
[Add in any other requirements, constraints, or behavior rules]
Return each table as a separate, downloadable CSV file.
The bottom line
In 2026, the analysts who stand out are the ones who build real projects with real complexity, not the ones with a stack of certificates and a clean Kaggle dataset they didn’t make. Synthetic data lets you practice on your own terms, build a portfolio that actually impresses hiring managers, and develop the exact skills companies are looking for right now.
So stop waiting for the perfect dataset to land in your lap. Go build it yourself.
Bye BDEs 💅🏼
Jess Ramos 💕
⚡️ Social Highlights:
⚡️If you’re new here:
💁🏽♀️ Who Am I?
I’m Jess Ramos, the founder of Big Data Energy and the creator of the BEST SQL course and community: Big SQL Energy⚡️. Check me out on socials: 🔗YouTube, 🔗LinkedIn, 🔗Instagram, and 🔗TikTok. And of course subscribe to my 🔗newsletter here for all my upcoming lessons and updates— all for free!





Have y'all tried this yet? I've made a lot of content about it lately because it's such an underrated hack!