For many years, Vyas Sekar frequently contacted Muckai Girish, an old university friend, to discuss potential startup concepts and seek Girish’s insights. Typically, these discussions concluded without further action. However, when Sekar approached Girish with a synthetic data concept in early 2022, the dialogue led to more than just a phone call ending.
Sekar, alongside his colleague Giulia Fanti from Carnegie Mellon University, had been developing synthetic data solutions aimed at addressing the reproducibility crisis—a significant challenge in academia concerning the inability to replicate research data. While Sekar recognised a pressing need within academic circles, Girish realised that similar issues were affecting his enterprise clients. Conversations with various businesses validated their thesis further.
Girish, the CEO, explained that it became apparent there was a substantial opportunity. Over the following months, the team engaged with investors and industry contacts, concluding that this issue was significant enough to dedicate their careers to solving it.
This collaboration led to the creation of Rockfish, a startup that leverages generative AI to produce synthetic data for operational workflows, assisting businesses in overcoming data silos. Rockfish seamlessly integrates with database services, including AWS and Azure, facilitating optimal data management in line with organisational policies and use cases.
The topic of synthetic data has gained considerable traction within the AI industry, enjoying a growing interest even at Rockfish’s inception in June 2022. Girish noted that Rockfish aimed to cultivate a unique product distinct from competitors, ensuring it served as an essential tool for enterprises rather than an occasional resource.
This commitment led to a product designed to continuously ingest data, primarily focusing on operational data—covering areas such as financial transactions, cybersecurity, and supply chains. These sectors consistently generate evolving data, setting Rockfish apart from its competition.
Currently, Rockfish collaborates with several enterprise clients, including the streaming analytics platform Conviva and various government agencies like the U.S. Army and the U.S. Department of Defense.
Rockfish recently announced a $4 million seed funding round, spearheaded by Emergent Ventures, with contributions from Foster Ventures, TEN13, and Dallas VC, bringing total funding to approximately $6 million.
Anupam Rastogi, a managing partner at Emergent Ventures, stated that he had been closely observing Sekar even before Rockfish’s establishment. He highlighted that the investment decision was driven by the quality of the team, the market opportunity, and the product, in that order. Rockfish’s focus on enterprise solutions positioned it as a more suitable investment prospect compared to other contenders in the industry.
Rastogi pointed out that the team consists of highly qualified data scientists, multiple PhD holders, underscoring the technical complexity of the field. Having such expertise is crucial, as they have contributed significantly to foundational advancements in the industry.
While Rockfish anticipates that its distinct focus will create competitive advantages, the reality is that the synthetic data market is poised to become increasingly saturated. Numerous AI companies are gravitating towards synthetic data solutions, as many believe the market for traditional AI training data has reached its limits.
Several startups are already vying for market share, including Tonic AI, which has secured over $45 million in funding; Mostly AI, with $31 million in venture backing; and Hazy, which raised $14.5 million before its acquisition by SAS in 2024, among others.
Girish indicated that Rockfish plans to enhance its synthetic data methodology by integrating additional modelling approaches, such as state space models—mathematical frameworks utilising state variables. The company is also dedicated to enhancing its comprehensive feature set.
Girish remarked that generating synthetic data is not simply a matter of sourcing random internet data; it does not ensure effective outcomes. The goal is to compile relevant and realistic data tailored to enterprise needs, continuously delivering value.