Reddit data DAO and everything to know about Gen AI model training


Reddit data DAO and everything to know about Gen AI model training

  en.cryptonomist.ch 20 July 2024 11:19, UTC

The Cryptonomist interviewed Anna Kazlauskas, CEO and Co-founder of Vana’s, which Reddit Data DAO, which in the first week saw 140k users sign up with verified Reddit accounts. Anna now is working with developers to build Data DAOs for other platforms, like LinkedIn and ChatGPT.

In addition to DAOs, they have other outlets for users to pool their data into datasets that can then be used for GenAI model training, such as creating portraits or avatars.

Beyond what Vana is doing, with Anna we talked about the growth of the decentralized AI space, as platforms help people use and monetize their data for new applications.

Summary

Can you provide an overview of Vana and its mission in decentralized AI space?

Vana is a user-owned AI platform powered by user-owned data. Our mission is for users to own their data and the value it creates through AI models. There’s a growing need for more training data to improve AI model performance, as ultimately AI models are only as good as their data.

For example, LLaMA 3 is trained on about 15 trillion words, which is roughly the amount of data available on the public internet. Companies are now trying to acquire more data, sometimes paying hundreds of millions of dollars for it. Major tech platforms are hoarding valuable user data and building new technologies without considering user permissions, which is holding back innovation.

At Vana, we’re liberating data from these walled gardens by putting it under user control. We allow users to directly contribute to AI models, choose how their data is used, and how AI is used. We believe we can actually outperform leading models if we can access the very best data–beating the performance of models like GPT-6 by accessing data only available directly from users. Vana is architected as a layer 1 blockchain designed from the ground up for private, user-owned data.

The Reddit Data DAO saw 140k users sign up in its first week. What do you think drove this rapid adoption, and what lessons did you learn from this launch?

The Reddit Data DAO was an incredible success from an adoption perspective, with over 140k users signing up in the first week. This level of adoption is unusual for DAOs–it is now the largest data DAO in history.

One of the things that drove rapid adoption is that so much of the story had already been set out, as users are becoming more and more aware of the value of their data through press coverage of data sales. Realizing that Reddit is selling your data for $200M or that Apple is buying up data for $50M makes you much more aware of its value.

There’s also a strong appetite for user-owned products built in web3 that move beyond familiar DeFi products to a new frontier of ownership. We’re seeing this trend in projects like Farcaster, DePIN networks, and data DAOs built on Vana, which represent a new wave of user-owned products.

One important lesson was the need for proof of contribution requirements. Over a million people tried to join the Reddit Data DAO, but many didn’t meet the criteria of having a Reddit account that’s been around for a certain time and has a minimum amount of data. This highlights the importance of having mechanisms to ensure quality contributions.

You mentioned plans to create Data DAOs for platforms like LinkedIn and Chat GPT. What unique challenges and opportunities do you see in expanding to these platforms?

Vana is a peer to peer network for user-owned data, and builders have created various data DAOs like the Reddit Data DAO, LinkedIn Data DAO, and ChatGPT Data DAO.

These different data sources are incredibly valuable for training AI models, but they’re currently locked away in walled gardens. Each of these platforms can be tricky to get data out of, but it’s always possible because of data regulation.

How does Vana empower users to monetize their data, and what are some examples of how users have benefited from this?

Our goal is to help users monetize and protect their data simultaneously. For example, with the Reddit Data DAO, they’re now training a user-owned model (mostly focused on shitposting at this stage, but it’s a start). Users get paid every time the model is used, creating an economic incentive for joint ownership of the model.

And user data stays fully private – rather than selling data, the data is just “rented” where the underlying data never leaves the secure environment.

With the growing concern around data privacy, how does Vana ensure that user data is secure and used ethically within Data DAOs?

Data privacy has shifted from being just an ideological or preference question to an economic one. If someone has your data, they can potentially create an AI version of you that is economically valuable, earning revenue and potentially competing with you. That’s why privacy is so important and core to Vana.

We invented a concept called “non-custodial data”, which is similar to a non-custodial wallet but for your personal data. It keeps your data under your full control, permissioned by your private key. This allows your data to be portable across applications and adds a native financial layer on top, enabling things like data DAOs to be built.

How do the datasets created through Vana’s Data DAOs enhance the training of generative AI models, and what advantages do they offer over traditional datasets?

Typically, AI models are trained with data scraped from the public internet – data that’s available without logging in anywhere. But if you think about it from the perspective of teaching a child about the world, you wouldn’t want them just wandering the public internet randomly. You’d want to give them high-quality information that might not be publicly available – things like high-quality writing, thought processes, or messages. AI is primarily trained on public data, but it really needs private data to push the frontiers. This is what data DAOs enable: users contributing their private data to create user-owned AI.

We believe AI should be created more like open source software, by a community. Our goal is to give researchers access to the best datasets that are currently held captive inside walled gardens to push the frontiers of AI performance.

What trends do you foresee in the decentralized AI space over the next 5-10 years, and how is Vana positioning itself to lead in this evolving landscape?

The decentralized AI space has really accelerated over the past year. For example, at EthCC this year, there was a decentralized AI event almost every day, compared to none last year. People are figuring out how to apply sovereign technologies that have worked well for finance to the AI space. At Vana, we believe that the core foundation of all this is data. To build user-owned AI and sovereign AI, you need user-owned data, so our focus is on that data piece.

In the next 5-10 years, I’m excited about a few milestones: 1) A user-owned foundation model collectively owned by 100 million people. 2) More autonomous AI agents that can earn on their own, and ensuring those agents are truly owned by the users who contributed to training them.

As AI plays a more and more important economic role, ensuring that power is widely distributed from both a technical and social perspective.

Can you share more about your collaboration with developers to build Data DAOs? What are some of the innovative projects currently in the pipeline?

Vana is a permissionless network, so anyone can build a data DAO. It is a layer one blockchain designed from the ground up for private, user-owned data. There are over 100 data DAOs deployed on the Satori testnet today. Many of the builders are early participants in the Bittensor ecosystem who deeply understand the intersection of crypto and AI. Some notable projects include the Twitter Data DAO, LinkedIn Data DAO, and GitHub Data DAO. We’re also partnering with projects in the ZK space and DAO tooling space to make data DAOs even easier to create and manage.

What ethical considerations are most pressing in the development of decentralized AI, and how does Vana address these issues?

I think one of the biggest questions in AI today is on who should own models and decide what data goes into them. As we start to rely on AI more and more for information, they become our source of truth. Whoever decides what goes into the AI is essentially deciding the truth. It’s scary to have a single entity controlling this. Our view at Vana is that the community, not a single company, should make these decisions.

One other question that comes up in decentralized AI is: if the AI is fully decentralized, then what if the AI goes rogue and there’s no off button? The way we approach this at Vana is that AI models are ultimately owned by the users who have contributed to them, so they always stay in full control.

What advice would you give to aspiring entrepreneurs looking to enter the decentralized AI space, based on your experiences with Vana and Data DAOs?

It’s a great time to start building in the decentralized AI space. There is a lot of opportunity to apply some of the crypto economic primitives that have worked well for DeFi to the new emerging category of decentralized data and AI. I’d also recommend spending some time diving into the non crypto, open source AI space to learn about some of the approaches people are taking outside a crypto context. I would dive in hands onto some of the existing projects to see what sort of primitives are available to build with, including trying out starting a data DAO on Vana.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top