Data Curation Networks: The Unsung Hero of the AI Stack

Data has been touted as digital oil for decades. This is becoming more true than ever with the proliferation of AI, as companies desperately need volumes of high-quality data to train their models. Data comes in many different forms and is acquired in different ways, but capturing high-quality data is expensive and inefficient. But what if we could connect AI companies to networks of users generating this data at a low cost? These networks could reduce the cost of capturing the data by going directly to the source. In other words, using existing hardware owned by users, namely edge devices such as mobile phones and laptops. 

DePIN networks have already proven the efficacy of crypto incentives for garnering the supply side of digital commodities. Fostering the demand side has proven much more difficult, as even the most prominent DePIN projects have relatively insignificant usage. Even as we’ve entered an AI craze, which has increased the value and scarcity of the digital commodities offered in DePIN marketplaces, such as compute and storage, adoption has been underwhelming. Many DePIN networks have hurdles to overcome, particularly in the form of an uphill march against entrenched hyperscalers and seamless end-user experiences, but there is space for a new subset of DePINs with a much faster road to capturing demand. While, compute, storage, and model training/hosting networks have proliferated with the rise of AI, data seems to be the most underserved (and yet perhaps most important) portion of the AI stack. Data Curation Networks (DCNs) can help capture unique user data to build rich model training datasets. DCNs hold a systematic advantage over centralized data collection due to their ability to coordinate individuals to collect unique datasets that otherwise wouldn’t be possible. 

Data marketplaces are fundamentally different from DePIN marketplaces. Generally, data creation happens at the user level, meaning data exists on edge devices. These edge devices are effectively nodes, which enable efficient decentralized data curation. This distinctly separates data curation from the commodity markets of traditional DePINs — centralized entities don’t have a monopoly on creating economies of scale in data curation.  DePIN networks have struggled to compete against centralized corporations that have large-scale, integrated services, and robust marketing and sales teams. Often, a DePIN network’s only significant differentiator is lower cost. 

In the case of Data Curation Networks, the process and strategy of B2B data sales is much less daunting. Networks focused on unique datasets will be able to generate significant revenue from a focused target market, whereas DePINs tend to require widespread adoption across a much larger audience. Another point of friction with many DePINs is the requirement of unique physical infrastructure — it adds a hurdle that disrupts adoption. In contrast, existing physical infrastructure can and will be used for data curation. The increasing quality of edge computing has created the rails for an incredibly powerful distributed edge computing network ripe for this type of work. Computers, smart phones, and other smart accessories are equipped with high quality chips and are capable of producing meaningful work. DCNs can enable users to frictionlessly collect and supply data via software solutions integrated with existing hardware.

Why are DCNs valuable? As AI begins to proliferate and take many forms, ranging from general-use LLMs to personal agents to business tools, we will run into a data wall. Organic human data will be a major constrictor of AI development, and thus will increase in value drastically. For AIs to align with human needs, only human data can serve as a source of truth. This organic data is necessary to unlock the scalable power of synthetic data that can be used to supplement training. DCNs can harness edge computing to capture, verify, safeguard, and monetize this indispensable data, helping prevent data from being a blocker in the AI pipeline.

DCNs also provide a way to capture a much more diverse world of data; the distributed nature of these networks can allow data capture anywhere that an edge device exists. This will become increasingly valuable as AI needs to gain insights from differing geographic locations and demographics.

Inspiration for the thinking around DCNs came from a protocol called Natix, which enables everyday drivers to earn passive income by turning their smartphones into data collectors, capturing geospatial data and imagery that is beneficial training data for autonomous driving, navigation, and logistics use cases. Precedent for the B2B sales of this data already exists and has proven lucrative; Landsat’s geospatial data has been valued at over $3B, and ESRI, the company behind ArcGIS, has revenue over $1B. In addition to a clear path for revenue, Natix’s approach is distinct from traditional DePINs and represents a paradigm shift for facilitating the curation of datasets. The differences boiled down to these unique points, which serve as the criteria for defining a data curation network:

  1. Passive data curation: users simply enable the driving app on their smartphone, and data is collected with little to no additional work from the user
  2. Resistance to centralized economies of scale: Driving data is most easily acquired through distributed coordination, and the infrastructure for it already exists in the form of millions of drivers daily.
  3. Value in diverse data curation: Capturing underserved datasets is financially inefficient for centralized mappers, but Natix captures these areas through crypto incentives and network effects. Driving information in different regions will differ drastically, so it is valuable to capture geographically diverse data.
  4. Utilization of existing hardware: While not necessarily a requirement for a DCN, Natix’s use of smartphones removes the need for unique hardware, typically a huge point of friction, which has afforded them rapid user growth.

While Natix is the inspiration for DCNs, many projects will capitalize on edge-computing improvements and the high price tag AI datasets are earning by following the DCN criteria. Through distributed coordination, DCNs can create unique and valuable datasets that are in high demand but otherwise impossible to generate. The era of AI ushers in motivated buyers for unique, quality data which can be used for creating domain specific expert models. Crypto incentives coordinate the collection of this data, and as long as the collection experience is relatively frictionless, rich datasets can be curated, enabling model creation at cheaper costs & higher quality than is otherwise possible. The potential of DCNs in creating legitimate revenue-producing marketplaces is vast, and this is being displayed by projects like Vana, Cudis, Hivemapper, and Grass.

Thanks to Matt Burke, Sami Kassab, and Anna Kazlauskas for sharing thoughts that were formative to this piece.