Why We Should Create a Collaborative AI-Powered Blockchain Economy That Benefits Everyone, with Sean Ren @ Sahara Labs (Audio)
Crypto Hipster
462
00:31:4918.03 MB

Why We Should Create a Collaborative AI-Powered Blockchain Economy That Benefits Everyone, with Sean Ren @ Sahara Labs (Audio)

Sean (Xiang) Ren, CEO and Co-Founder of Sahara AI

Sean is the CEO and Co-Founder of Sahara AI, a decentralized AI blockchain platform for a collaborative economy. Backed by top investors in AI and Crypto, including Binance Labs, Pantera Capital, Polychain Capital, Sequoia Capital, Samsung Next, Matrix Partners, and many more, Sahara AI has raised over $40 million to advance decentralized AI. Today, Sahara AI is trusted by 35+ leading tech innovators and research institutions, such as Microsoft, Amazon, MIT, Character AI, and Snapchat. Additionally, Sean is an Associate Professor in Computer Science and the Andrew and Erna Viterbi Early Career Chair at the University of Southern California, where he is the Principal Investigator (PI) of the Intelligence and Knowledge Discovery (INK) Research Lab. At Allen Institute for AI, Sean contributes to machine common sense research. Prior, Sean was a Data Science Advisor at Snapchat. He completed his PhD in computer science from the University of Illinois Urbana-Champaign and was a postdoctoral researcher at Stanford University. Sean has received many awards recognizing his research and innovation in the AI space including the WSDM Test of Time Paper Award, Samsung AI Researcher of 2023, MIT TR Innovators Under 35, Forbes 30 Under 30, and more. 

[00:00:02] Hello, everybody, and welcome to the Crypto Hipster Podcast. This is your host, Jamil Hasan, the Crypto Hipster, where I interview founders, entrepreneurs, executives, thought leaders, and all amazing people all around the world of crypto and blockchain. Today, I have another amazing guest. I actually look forward to this interview. He is the co-founder and the CEO of Sahara AI. His name is Sean Ren. Sean, welcome to the show.

[00:00:31] Hey, thanks for having me, Jamil. Glad to be here. You're very welcome. Thanks for joining me today. So let's kick things off. I'll ask you first, what is your background and is it a logical background for what you're doing now? Yeah, I will say my background actually is a pretty unique one within the whole crypto space. I'm actually a professor, associate professor at USC, Computer Science Department, before I started Sahara.

[00:00:59] I'm still a professor there, but I've been basically working with 20 plus PhD students in the area of AI and what we call natural language processing for over 12 years. I do mostly research on model safety, security, thinking about how to make them cheaper to be customized towards different use cases.

[00:01:20] A lot of time involves what we call distributed architectures where we can train this model with many of the user devices without grouping them into one center server.

[00:01:32] And all these actually are quite relevant to what Sahara AI is tackling in terms of the technical challenges because we've been thinking about how to protect user data without actually storing every single piece of the data into one centralized space.

[00:01:50] We're also thinking about how to build models without getting too much of the privacy exposed to the external developers that are involved in the process of building this model, things like that. So I would actually think this is a very, very, like a sort of a natural connections between what I have been researching on and what I'm doing today.

[00:02:14] But also just to add on, I also spent a lot of my time doing my professorship career into industry. I worked with Snapchat for almost like a year and a half to be the consultant of their data science team to think about how to take some of these latest AI technologies into the actual deployed products within Snapchat. I helped them actually boost user engagement and stuff like that, understand the user persona better.

[00:02:44] So I would say throughout the time before I started Sahara, I get both to the frontier of what these AI technologies are moving forward as well as like to the actual battlefield of where these technologies get deployed and what are the constraints that can make them being used or not used in the practical scenarios. So, yeah, I appreciate that kind of background a lot.

[00:03:13] Awesome. Awesome. So the answer is yes, it is logical. Excellent. It is. Yeah. Awesome. So Sahara AI, right? What's it all about and how do you transform asset development? Yeah. I mean, I can just give a little bit more context about, but a simple way to put Sahara AI is we're building sort of like copyright systems for AI access, right?

[00:03:42] We're talking about data sets and models, which are the two main types of access to people being like exchanging and trading and sort of building around AI era, right?

[00:03:56] And the problem of building copyright system all comes from the, I think the whole premise that back in 2022, I was like working on this model called GPT-3, which is the earlier generation of chat GPT where everyone knows, right?

[00:04:12] But what I've observed is like GPT-3 is way better than the previous generation of the model, meaning that it potentially could just start replacing people's jobs and part of their business process and taking over their monetization opportunity, right? That's what I've all been thinking about back then. And so I was talking to my co-founder, Tyler, about assuming this model is going to start taking over our jobs. So what is the burning problem?

[00:04:41] And apparently ownership becomes the keywords, right? Like you've been using chat GPT every day by uploading some of the data or your information and you're giving them feedback. So over time, they start learning about your thinking process and your know-how. And that's why they become better and better and to the extent that they can be better than you in doing your jobs.

[00:05:07] And I don't think this trend is, I think this trend is invariable. So it's not going to come back anyway. So the question now is, when you are giving out your personal knowledge and information and interactions with this AI, how do you continuously to claim your ownership on this data? How can you continuously to claim ownership on the model that was built using this data, right?

[00:05:35] This is what we call, you can see this as a copyright problem. So that's basically the starting thesis of Sahara AI is about using blockchain as the sort of underneath technologies to create a sort of like what we call provenance infrastructure to record who's the contributors and owners of these AI assets.

[00:06:01] And using this provenance, it's like an open ledger, basically. Everyone can see an audit who owns what, who contributes to work. So then you can use that to create new business models like revenue sharing between who provide a compute, GPU compute, who provide the recipe of the algorithm for training the AI and who provide the data sets for use to train the model. Yeah, things like that. So you work the chat GPT. I saw a video recently. I've had a picture.

[00:06:30] It showed chat GPT guy fishing in the pond with using a fishing pole. And then it showed the picture of the deep seek guy fishing out of the chat GPT's bucket that he had to fish in. Right. So. That's a funny one. Yeah. I mean, I think this is going to be ended. It's like battles between all these, like a big AI players. Right.

[00:06:57] Like every other time there's going to come coming a new, like a game changer. And this, in the earlier moment, it was Lama from Meta who basically threw out a free model. Everyone can use even for commercialized case for free. That changed the game of open AI ecosystem apparently. And today, deep seek came out.

[00:07:20] It changed the game again because not only did they release free model, they also tell you the entire recipe of how this model was built. That basically is like, hey, I'm giving you out a $6 million recipe for free so everyone can try to rebuild it. And I think the implication is even more significant that all the valuations of these big AI companies like Anthopic, for example, need to be reaccessed.

[00:07:47] Because now I tell you a model as good as GPT-4 is for free out there. Then, I mean, automatically all the valuation of these companies should drop, I don't know, to the 10% of it.

[00:08:03] And I think this is why if you look at the media, all these companies' CEO, they become very haddock explaining the difference and sort of what they are, why they are doing worse, that kind of valuation. I think, yeah. We'll see. We'll see.

[00:08:25] I was thinking that, you know, I question whether or not that big learning model is the best model or if it should be something that's kind of federated learning based on a much smaller data set. You know, what do you think? I think, I mean, this is a great point because I think this potentially presents very diverged thinking that the society might have, right?

[00:08:53] On one from federated learning has been there for decades. And it's having a very hard time being adopted and shine in the web to companies. For example, in Google, we know this team dedicated to federated learning a while back, but they didn't just get too far.

[00:09:15] I think the point is the user wasn't educated enough about use the privacy and security of the data yet, right? When you show them an app and they just need to click some agree on the consent to use the app for free, they just tend to do it.

[00:09:34] And they didn't think about like all the implications behind those consents that your data are going to be basically used by the apps and then the AI models to do whatever optimizations they want to have. And this is like thinking about aggregating of millions of users, millions of users data. So it's kind of scary, but it doesn't really appear to us like so. That's why I think the federated learning, but just for the audience prefer, right?

[00:10:02] It's trying to, what they're trying to do is like without a user upload the data from their local device to the central server, they can update the model in the central server in sort of like a privacy protected way. It's a very nice advanced technology in the research committee. We're still making a lot of progress every year, but it just doesn't have that use case as much as we hope because the AI company,

[00:10:29] they don't think it's necessary to deploy such a complicated and sophisticated process. They can just like pull the user data to the server and do whatever the traditional kind of learning system, right? What you call big learning kind of process. Right. So let's talk about that data privacy and security, right? What's the role of decentralized AI, right?

[00:10:56] Including what are the data privacy and security challenges and concerns in Web3? I think all these dots are connected together for what I said so far, right? So claiming user ownership on their data model is basically coupled with data security and privacy problem.

[00:11:20] Because if you leak your personal data to others, let's say if it's even public on some black market that people can exchange and purchase, then there's no value, there's no enough value for your personal knowledge and data, right? So in order to protect, say anyone use your data sets by licensing it with a certain fee or through a revenue sharing mechanism,

[00:11:49] then you have to make sure your data is securely protected somewhere, right? And the idea we are advocating these days is to make sure all of the AI computation happens on this trusted execution environment or TEE. This is like a hardware supported encryption environment for doing data storage, doing data computation.

[00:12:17] It happens on the latest generations of Intel CPU chips and NVIDIA GPU chips like H100, for example. So the nice thing is, anytime if you decided to use your data to power your own AI or you use it to power some other people's AI,

[00:12:37] you can send your data in a secure way to these TEE environments on any of the cloud servers with H100 or these TEE chips. And then anything happens within the TEE environment is guaranteed to be secure. With the caveat that you have to trust these chips, right? Like, let's say if you don't trust NVIDIA, then this is not the solution you want.

[00:13:02] But I think we need to talk about even more advanced technologies like a fully homomorphic encryption, which is not going to happen today. It's probably going to take another few years to mature, right? But yeah, I think the TEE is our best bet today to really have data security, computation, encryptions, even the verification of the computation so other people can audit what's going on.

[00:13:27] When you say, hey, I use this model to make an inference on your API call. How can I trust that? Are you really using that model? Are you really using my data? Did you put in any like poisoned data in the process to create some sort of backdoor? All these things now can be also verified on the TEE environment as well. So that's overall what we call the kind of the security and privacy protections we can achieve today.

[00:13:59] I don't like to prove my guests wrong ever. Okay, I don't like to, but there's no time like the present, right? So why can't we talk about fully homomorphic encryption and what the challenges are with that in this digital age starting now? I mean, we're making a lot of progress, right?

[00:14:22] There's like billions of dollars, investments add up together into this FHE area. Like Zoom and Zoomer is like a prominent example, right? But what we learn from their research results so far is they can only work on simple models. Like, I don't know. I don't think it's even like a force like linear models and maybe more complex could be a non-linear models like a classifier.

[00:14:49] But they never get to the complexity of transformer models, which are basically what people have been using every day when they talk to this GPT or IPSC model. So, I mean, the thing is like designing this FHE algorithm is not like linearly kind of a proportional to the complexity of the model. It's actually somehow like non-linear, right?

[00:15:18] So you go from a linear model to a transformer, the complexity is like, let's say, a sudden times. But then the complexity of designing this FHE algorithm could be, I don't know, a million times harder. And that basically translates to the time cost when you are running this FHE process with transformers.

[00:15:41] So that makes you thinking about, if you look at today's best inference endpoints of these models, you get maybe 50 tokens responded to your questions when you are asking a Lama 3 APIs. But if you deploy FHE on this process, you might wait 10 seconds for one word to be spoken out from the API endpoints.

[00:16:10] That's unacceptable user experience. So that's why I think it's too far away. Yeah. Okay. Thank you. That makes a lot of sense. So let's talk about something that's a little more mainstream, right? We have something right now, I mean, with all these huge firms, you mentioned NVIDIA, but they're not the biggest. There are other huge, massive centralized data monopolies, right? How do we, you know, how can the centralized infrastructure reduce the dependency upon them?

[00:16:39] How do we do like a Teddy Roosevelt kind of trust buster, but on data monopolies? I mean, that's a hard question. I honestly don't think we have a good solution to turning these centralized players into more decentralized ones. So our thinking, especially representing Saharia, is more like we have to just do this from scratch in a totally different manner, right? Let me break this down, right?

[00:17:09] So I think the running model for the centralized provider is that they have control on the revenue-making application. If you think about Google is the act of the search and meta is all this social and so on, right? So they control the revenue, then they can use the revenue to control the compute, the GPU, the compute data center.

[00:17:34] They can also hire the best talents, and then they get the better recipes of the AI because they have the talents. And then so on and so forth. They can continue to accumulate this effect. So they go into a fine route that they can always stay on top of the game. Even with disruptions like DeepSeek, I'm sure they will catch up pretty quickly in like a few months to still stay on top of the game

[00:18:02] because of all of the cumulative resources and the revenue they have to make this happen. For DeepSeek, I'm actually a little bit worried that if they're giving out people to use DeepSeek for free or with a very low cost, they're going to continue to lose money. It's not sustainable, even though this is a very good model. So again, that's why any of these companies dare to do free launch for people. They have to have a very solid revenue foundation.

[00:18:30] That's why the central providers are, central players are, sorry, the major players of AI are sort of being stable over there. And now if you want to say, hey, I want to force Google or Meta to give out the user data and then to be more transparent on how the data was used in AI, this is really hard to be done in a bottom-up manner.

[00:18:58] So our best bet is probably a top-down approach that there are going to be regulations, legislation to them. And that sort of seems to happen in a Biden kind of time, but I don't think this is still going to be the case with Trump over, seems like the direction is not going to be more strictly compliant on the whole AI landscape.

[00:19:24] So I think this is just giving the big companies more rooms to break. Versus our, we think that a way to break this down is really to reconstruct the market. You have to educate people about the importance of their personal data and then making sure that they don't upload and leak their most valuable data access to this provider.

[00:19:49] Rather, they use, for example, in Sahara platform, you can use what we call a vault, which is basically a sort of data container in your local devices, like your laptop, that can have the encryptions and all these mechanisms to protect your data. And you use the vault to communicate with the TE environment. So everything is end-to-end protected, right? So data sent to the TE environment due to AI computation, which is also protected.

[00:20:18] That way, you can really have a more guaranteed data security and user privacy. And then you can claim your ownership on these things, right? Otherwise, ownership doesn't make sense. I was going to think that ownership is more important than controlling the revenue. Because why don't I not put my show on YouTube, but I do put it on Spotify and Apple's, because if I put it on YouTube, then Google takes ownership.

[00:20:47] And I'm like, I don't like that. So I think the education is important. And I agree. So say you are able to break in or somehow break their model, right? What would it look like? What would the world of decentralized AI look like? And how would that help promote and prompt innovation around the world?

[00:21:18] Yeah, I think the simplest way to understand the outcome is that developers are able to make money. This seems like a very funny thing to say. But if you look at today, how developers making money is they work for big companies. They get the shares of Meta, Google, and OpenAI and wait for upside. And this is where my PhD students are going after they graduate as well.

[00:21:47] Because this OpenAI was easily giving out a million dollar compensation package for a top tier PhD graduate. Thinking about how crazy that is. That happens like last year. I don't know what happened this year. But even with such a bad job market, that still happens with these best AI companies. So, where are I?

[00:22:14] I was trying to say basically, sorry, I got stuck on my head. What was the question again? What innovation, what new areas of innovation are possible for you to break down their control of this market? Yeah, I think basically I was talking about the developers, the central part of this whole puzzle, right?

[00:22:40] Like they have the entire recipe and the know-how to build this model to bring user data into valuable models and then monetize the model. But today, the path is for these AI developers to shine. It's very limited. They work for big companies.

[00:22:56] So, I think the big opportunity for everyone here is to think about a different platform where these developers can actually have much more freedom in terms of monetizing their know-how and their skill sets and the data and the access that are made using their skill sets, right? So, a good example to think about is this platform called Hugging Face.

[00:23:22] Not sure if every audience knows, but this is like an entire open source platform where people can upload public data sets right now and upload models, open source models there. And everyone can come there and then browse this like a marketplace. They can get any data sets and model download. They can talk to the developers. There's a forum. There's a discussion.

[00:23:46] It's like one-stop shop for people to get the materials and building blocks for their AI development process. And it's entirely for free these days.

[00:23:56] So, we're thinking about what if we can turn Hugging Face into a proprietary sort of setup and we can allow these people contribute to Hugging Face to become the owner of those assets and be able to make money automatically out of people who are downloading, using their data set licensing and using their models per pay as go, for example.

[00:24:22] So, this seems like a very natural thing because you are basically creating a market out of non-market and Hugging Face is a non-market because there's no value associated with those assets because everything is public open source. But then there must be a market version of it that people uploading say some of my emails which are valuable business source.

[00:24:47] If I would love to monetize them, then I want a place to do that and people can price it. And there's a longer story about should we do data set pricing? I don't think so, right? I think data pricing is a hard problem because you can tell each of my messages $100 or $1,000 or $1,000, right?

[00:25:07] But I do think when you check how the data was used in a model and then check how the model was used in an application that make it revenue and if you can automatically attribute those revenue back to these data set owners, that's a meaningful way to price the data set. Meaning that like how much my data set works depending on how much the model built on my data sets make, right? So that's like the revenue sharing mechanisms between all these assets.

[00:25:37] So we want to make that happen. So then eventually the goal is just to make the model developers and the data contributors and app builders can have this one-stop shop to publish their access, their purportary access and make money. Yeah. Interesting. Let's see. So I just thought – I just had a thought about my podcast.

[00:26:04] I'm like how do I really value the – how do I really value the conversations? And it's – if I talk to you before you go live or whatever and then you go – and you make a billion dollars or whatever, how much of that was attributable to my podcast? Maybe $100,000? I mean we'll see. You know, only 1%. But it's a – pricing is a hard challenge, right? So say it's – say – and I'll leave it there. We'll talk again another time on pricing.

[00:26:32] I'd like to have that be like a conversation. But I want to talk about the whole entirety of the AI lifecycle, right? What does it mean to have transparency, you know, throughout the lifecycle? And how does it enable the unlocking of monetization opportunities? Yeah.

[00:26:56] I mean today if you look at – I think taking DeepSeek as an example is perfect because it tells you – it gives you like the transparency, right? Like it tells you, hey, what data sets are used in training this DeepSeek model? What is the model architecture, right? How large is the model exactly? What's the specification of all of the details? And how do we train it, right? Do we use this algorithm to train it or the other one?

[00:27:27] And then what other tricks we use to make these things look better, right? All these are what we call the recipe. And for majority of the models out there like GBTs and LAMA, we don't know this recipe. So we don't know what are the contributing factors to this – to the success of the model.

[00:27:52] So when we say putting this whole AI lifecycle on the blockchain and creating a sort of like a provenance that is auditable, that is – no one can modify it once it was created. Everyone can audit. This is like the thing we're trying to achieve, right? Thinking about if another model like DeepSeek got created, but it was created on our platform. So we know what data sets was used.

[00:28:21] And we even know within these data sets who contribute each data point, right? Who give these tweets, this Reddit post, this like Wikipedia page in the data sets. And we have all of this information stored on China. So we can basically use that to power the revenue sharing.

[00:28:43] I mentioned this many times, but that's super important because if you don't have this automatic revenue sharing executed by a smart contract, looking at the blockchain itself, then there's always a problem of like fairness, right? Like can you trust – fairness and trust basically. Can you trust like you're supposed to give me $2 and now I receive $1.5. Is there something wrong about it?

[00:29:12] Like I need to – typical way is you need to – you need to go – you need to dispute. And then you need to find someone to do the auditing and then give you results. And if there's a problem, you do mediation, right? I mean now thinking about all these processes wouldn't be a hustle anymore because everything is executed by the smart contract. You can go back to audit yourself.

[00:29:34] So we really want to create these like autonomous systems of revenue sharing based on the provenance of them on the blockchain. Interesting. It's an exciting area. I'm looking forward to see how it grows and develops over time, right? Yeah. So I want to thank you very much for speaking with me today. I think I learned a lot. It was great. You know, I have one last question.

[00:30:04] It's how can people find out more information about Sahara AI? How can they use – or how can they become a developer with you? How can they use your, you know, new platform protocol? How can they do that? Yeah. Yeah. Yeah. The best place to start is our website. It's Sahara Labs.ai. We'll share the link with you as well. So then there you can find a bunch of channels that can connect with us.

[00:30:33] And I think two notable ones. One is the Discord channel. Unfortunately, I think we have almost like 800,000 users right now. And we're kind of trying to do a server upgrade these two days to make rooms for more people. And then our Twitter accounts is definitely another great place to look at the latest updates. We will make announcements on all of the product launch on the application to our waylist,

[00:31:01] turning people into the wireless user to use our different products. It happens these days. It's like our data service platform have a very – have like 100,000 wireless users, but we're able to get to like 80,000 daily active users in the past week. So we're very excited to get to know more of you and then get to work with more of you

[00:31:26] to create this whole user-owned AI kind of era together using our platform. Yeah. Awesome. Awesome. Thank you very much for your time today. Thank you, Jeremy.

Digital transformation broadcast network

Follow Us on LinkedIn

Follow us on LinkedIn and be part of the conversation!

Powered by