Strategy for building in Synthetic Data Space
If you are coming for the king, you better not miss. ~ Sam Altman
If I was building an AI startup in 2025, my focus would be on 2 markets - AI reasoning Infra (will probably write in another article), or synthetic data generation. We have pretty much exhausted internet data, and all frontier models will now be trained on some variant of synthetic data (both in pre-training as well as post-training), and distillation from these models to train smaller models will also require good data curation pipelines. Some strategic things synthetic data gen companies can do (from my own learning, working at a top AI startup in India and interfacing with bunch of companies) -
1. New market creation vs cost reduction - synthetic data should help companies unlock new capabilities in current model, or bridge gap between usability of models for particular usecases.(Account aggregator model) Cost effectiveness comes from scale and infra management which will have a cold start problem. Regulatory arbitrage will soon get eaten up by bigtech. Bigtech also being able to distill from bigger model helps them train much smaller models better (https://platform.openai.com/docs/guides/distillation).
2. Every time you can show better training with your data on new model (closed/OSS) and reduce cost (by x factor, not by 30, 40%), you unlock the next level of economy customers. Good synthetic data curation pipelines are designed to improve continuously with new model releases. Making it simple for potential customers to evaluate your data curation pipeline and pre-trained models – via ready-to-use demos, easily accessible APIs, and clearly presented performance benchmarks – is crucial for attracting these new customers and quickly proving immediate value and broader market adoption.
3. Structurally Lower Cost Data Generation that can’t be easily copied by competitors. This has multiple dimensions - Sell to top labs to be able to develop toolings that makes creating a lot of these synthetic pipelines much more performant and easier. Second is requiring partnership with different tooling providers/oss tools - Simulation engines, tools, verification tools, agent builder platforms. Third is the GTM itself - acting as a BPO and selling an outcome - where the end model’s quality is the outcome.
4.Most money is in inference, own the loop - You can generate high quality synthetic data. But combining that data with a tightly controlled RFT pipeline is where the real strategic advantage lies, especially for action-oriented enterprise agents. If a client needs truly reliable, task-specific AI agents, they won't just need synthetic data – they need a way to continuously refine and improve models based on feedback and real-world performance. Owning the RFT pipeline in-house becomes a critical offering beyond just data generation.
5. Winning Vertical-Specific Use Cases is paramount. All strategy starts at whether you want to create a horizontal platform for all kinds of data generation, or go deep in one domain/type/modality. As models get better, the value provided by horizontal layer, as well as human in the loop will become lower, and will need expert human labelers. (imagine creating prompts for o5!). My bet is synthetic data remains on verticals specific data,in extremely fast growing industries which can massively benefit from synthetic data - ecommerce, robotics, Vertical agents, video generation, code,biotech.Being vertical specific helps build in-house domain expertise. Hire experts, partner with industry specialists, deeply understand vertical workflows, pain points, and data needs. This enables highly tailored and impactful synthetic data solutions.