Tapping Into a Global Contributor Network to Source AI Verified Insights, with Rowan Stone @ Sapien (Video)

Rowan Stone, Chief Executive Officer at Sapien, previously led Business and Operations at Base and was Director of Onchain BD at Coinbase, driving growth for products like cbETH and USDC. He joined Coinbase through the acquisition of Totle, where he was COO. Prior to Totle, Rowan co-founded Horizen Labs and Launch Code Capital, bringing over a decade of experience from energy and engineering sectors into the crypto space.

[00:00:03] Hello everybody and welcome to the Crypto Hipster Podcast. This is your host, Jamil Hasan, the Crypto Hipster, where I interview founders, entrepreneurs, executives, thought leaders, amazing people all around the world of crypto and blockchain globally. And I have another amazing guest for you today. I'm looking forward to this conversation. I'm going to keep saying his first name wrong. I got to say it right because I want to say Rolling Stone, but it's not. It's Rowan Stone. He's the founder and CEO of Sapien. Rowan, welcome to the show.

[00:00:34] Thanks for having me. And don't worry about the name. Literally nobody outside of Scotland will ever pronounce it correctly. So you're good. Perfect. Awesome. Awesome. Well, pleasure to have you here today and I look forward to speaking with you and I'll kick things off with the first question is ask everybody. What is your background and is a logical background for what you're doing now?

[00:00:58] It's a great question. Yeah. So my background from Scotland, which I may have alluded to already. And really in Scotland, there's only one industry and that is energy. We have finance and whiskey and other things, but when you're young and you're looking to make some money, that's typically where you end up. And so I spent the first half of my career, just shy of a decade in the energy industry and the oil and gas industry. Absolutely hated it. And very quickly took an escape shoot.

[00:01:26] A friend of mine told me about magical internet beans. And so ever since pretty much another decade, I have been building a variety of different companies in and around the on-chain world. And I've recently kind of migrated that thought process and that effort from, I guess, coordinating capital to now coordinating data specifically for AI.

[00:01:50] And the context here, I can go into more detail in my background if it's interesting, but the context here specifically for Sapien is that AI essentially is a child and it needs as much information as physically possible in order to become smarter, more sophisticated and more useful for all of us in the world.

[00:02:08] And so Sapien exists to essentially match up enterprise businesses that are typically building vertically specialized AI models, things like the model to drive an autonomous vehicle or a model to help a kid with their math homework, these sort of very specialized things. And we match those enterprise businesses that need data with essentially everyone everywhere. And our thesis here is super simple.

[00:02:36] All of us, regardless of our background, regardless of what we do, whether we are a VC or a podcast host, a different type of creator, a lawyer, doctor, engineer, we all have information, data in our heads that's super valuable to AI. So we're building a company to help match that up, help provide the framework, the transfer of the knowledge and ultimately make AI smarter and more useful for all of us. That sounds good in a nutshell.

[00:03:07] So I want to find like more information about like how you're changing the way AI models are trained, really. So we actually don't get involved in the training piece, but the big unlock here is throwing humans, really industrial scale humans into the data. And so the unlock was actually discovered by OpenAI.

[00:03:31] The reason that ChatGPT was leaps and bounds above the meta and the Google models at the time when they first released is simply because they threw huge numbers of people at the data set. And so they were able to hone and really get the data to the point where it was properly ground truth and of very high quality, deduplicated, just actual useful knowledge rather than just blankly screening and scraping everything they could from the Internet.

[00:03:59] And so adding humans into that kind of data loop at the very start and then continuing to use humans to fine tune the data sets and make sure the models are doing training based upon actual fact and useful things and not nonsense, because this really is a garbage in garbage out type process, was the major unlock. And so we're following that exact hypothesis. We are looking to connect large numbers of users everywhere in the world.

[00:04:29] And that's really important. The traditional way this has been done ever since OpenAI pioneered is to kind of use a hub and spoke model whereby you have large warehouses, call center type things full of people, typically in the developing world, kind of Bangladesh, Philippines, East Africa, places like this. And they are paid to annotate or structure or sometimes even provide net new data.

[00:04:57] And that's kind of typically what the market's known as data labeling. And so that was the way it was done previously. The way that we're doing it, rather than having this centralized hub and spoke model, we're creating essentially a gig work platform that everybody can participate in. And we're allowing people to build reputation over time in the system. They need to put a little bit of skin in the game in terms of some money alongside putting their reputation on the line when they do work.

[00:05:23] But by doing so, they have access to work, potentially sophisticated, high paying work that they otherwise wouldn't have. And importantly, because we do it in this open distributed way, and because we have QC done by peers, we're much more efficient, which means we can process a lot more data. If you think about the way Sapien runs today, for example, we have a small office facility with 70 roughly people that are doing quality control work. We're doing the things that don't scale.

[00:05:52] The plan is to migrate very, very soon, actually, towards like within weeks, not months, I'm not going to give too much away, to the open model, which means we're creating a token. And that token will become the incentive structure. But the key for the conversation is that it moves away from a very small bottlenecked QC function into being able to allow any of our 700,000 plus users to participate. And so throughput goes from 60 people to 700,000.

[00:06:21] But really importantly, for the people that are buying or kind of contracting for the data is that it's not going to have the same biases that it would have if you sourced all your data from one geographic location, or from one subset of people in the world. And so the summary there is, it's a different take on how to source and structure data, different model.

[00:06:45] And we're essentially taking learnings from a decade of building in the on-chain world, and being able to coordinate capital without peers or without middlemen between peers, and applying that to a data problem, applying it to AI and allowing the companies to source in an ascented aligned way, whatever they need to build their model. Okay, so let me see if I understand this correctly. You believe that everybody has data in their head, that's true.

[00:07:13] You have 700,000 users. Yeah. And your users are from the gig economy. Not all, but what we're trying to do here is kind of buck a trend. The trend, particularly in the media, is AI equal bad AI taking all of our jobs. And yes, some jobs may become no longer required because we can automate them. However, AI doesn't work without us. Again, going back to the analogy, it's a child.

[00:07:43] Even now, it's a super sophisticated child. But if we don't provide it with the right knowledge, the right data, the right information, it's not going to be useful for us in the future. And so humans are essential to the learning process, essential to the data capture, and essential to the training and development of an AI model.

[00:08:03] And so rather than kind of the media pushing the standard narrative of AI is taking our jobs, we are looking to actively employ as many people as we possibly can to provide these companies with the data, the knowledge they need to make these models useful for everybody. And right now, that number is about 710,000 people. But it's increasing by 5,000 to 10,000 people per day.

[00:08:28] And once we move into a full production environment, I can talk a little bit about kind of the difference between where we are now and where we'll be in about two to three months' time. It's going to be able to scale much, much faster. Yeah, I do want to know that. I do want to know your vision and your mission and how you do it, how you include the human element instead of just scraping data.

[00:08:53] But I'm interested in your view and your long-term vision too. The vision here is super simple. It's a world in which everyone can participate and earn a meaningful income from providing their knowledge to the companies that are building AI models. And so essentially elevating this is a new type of gig work economy whereby you don't need a car for Uber. You don't need an apartment for Airbnb.

[00:09:20] You need your time, your knowledge, your understanding of the world around you. And you're able to monetize that through a clear, transparent, open, and permissionless system where you just rock up, move your worth, and dig in. Awesome. Awesome. So, okay. I would think that being the data set that you have based on people who are interested in earning, and, you know, there's a bias there.

[00:09:50] You're not capturing – some people want to go work for somebody and work a nine-to-five job, you know, and, you know, that's all they want to do, right? So you're not capturing everybody. You're capturing a selective group of people who are, you know, who want to be entrepreneurs and founders and gig workers, right? So how do you overcome the biases? Yeah, that's a great question.

[00:10:15] And so maybe it's worth just talking a little bit about the types of data that we typically have demand for, because that really steers the work that's available and therefore kind of guides us in terms of which types of users we're actively looking to attract and looking to provide kind of an income for. And so the two main types of data that we are focused on, on one hand, it's 3D and 4D data.

[00:10:42] So think of this as the type of data that's generated by autonomous vehicles. And so whether it's a Tesla or an Amazon Zoox taxi or any other type of car that's slowly adding this autonomous vehicle functionality to the normal driving experience, all of these vehicles are doing millions of miles every single year, and they're constantly encountering weird situations because the world is random.

[00:11:07] And so you can't train for random, but you can consistently steer them in the right direction and help them understand, A, what on earth are they looking at through the LiDAR data and different sensor information they capture? And B, how should they react in any particular scenario? And so a big chunk of the world work that we do, and we work with some of the largest automotive manufacturers in the world,

[00:11:29] is to provide them with 3D and 4D annotation so they can better train their model and ultimately keep drivers, or in the case of robo taxis, passengers safer when they're working. And then the other big demand area that we have is more on the kind of collection side. And so we don't just annotate existing data sets. Our manufacturer brings us data, we annotate, and we send it back.

[00:11:55] We also source data specifically from our community of users. And so this could be something as simple as an image of a piece of paper with some handwriting on it. Perhaps someone is building a model to recognize human handwriting and transcribe that for note-taking, things like this. Or it could be images around the house. It could even be LIDAR data around the house to help recognize things in three dimensions,

[00:12:21] all the way through to audio speech recognition to help the Siri and kind of Google Voice type things better understand weird accents like mine from the north of Scotland. And the piece that's really interesting to us and the piece that is growing massively in demand is what we call chain of thought, chain of thought reasoning. And so it's no use giving a model an answer to a problem.

[00:12:47] Because yes, now it knows the answer, but how is it going to tell someone else how to get to the answer? And so a big part of the data collection that we do is not just 5 plus 5 equals 10, it's explaining why is 5 plus 5 10. Or in the case of medical screening, which we've had a couple of pretty large contracts recently, it's not just saying this particular area in a radiography image is suspicious and should be checked out.

[00:13:17] It's the chain of thought as to why the doctor thinks that particular area is suspicious, so the model can then learn from it and properly diagnose things going forward. And so just again to summarize, a big chunk of our work is for autonomous vehicles, 3D, 4D data, and increasingly for robotics, which is exactly the same data sets, just from a different kind of modality. So rather than vehicles, it's now humanoid robots or manufacturing robots that are capturing the information

[00:13:46] and need the human input, or it's capturing knowledge, understanding, context from humans, even preference in some cases, survey type data. And so those two things really drive the work that's on the platform, and therefore drive what's going to be attractive essentially to what type of people. Obviously if we have medical work, then we're looking for doctors. If we have something in the creative space, then guess what, we're looking for them.

[00:14:13] And the idea is that over time we will have a variety of different tasks available on game.sapien.io, and people can jump in, find something that suits their knowledge, their understanding, or even just the time that they have to spend, and earn some money by providing that information. So it's not just limited to the gig economy, you're actually looking for experts to help to also,

[00:14:39] who are experienced in like medical and stuff like that to help augment your 3D, 4D, with their experience and wisdom. Absolutely, yeah, absolutely. Increasingly seeing expertise as being hugely in demand. And it makes complete sense. Like we're trying to train AI models. They're going to get everything they can from the open internet, which has already been absorbed many times over.

[00:15:06] They can also just reach out to the open internet for context whenever they need it. And then they're going to encounter problems where they just don't quite know the right way to answer. And so the way to teach them is to find the right experts that have that subject matter expertise, and lean that into the data set so that model now can properly be useful for millions of other people that might have similar questions. I'm going to go back to your vision then.

[00:15:32] When you have, there are certain, like right now I have a, I have a, you know, a rare, you know, I'm saying this is a rare condition, right? It affects one in two million. There are a few experts in this area who are doctors, you know, interventional radiologists, oncologists, you know, they have an expertise in a certain area. How do you think their expertise, not these, you know, my doctors, but doctors who like with limited expertise or specialized expertise in particular,

[00:16:02] you know, once they come onto your platform augmented, what do you think is possible in the next evolution of, you know, medical and devices? And, you know, is that because of your platform? I'm not going to say because of our platform, because that's a very tall statement to make. However, we are actively trying to play a meaningful part in making this future possible. And so that's a clean, clear part of why we exist,

[00:16:31] but it's definitely not a sapien will do X, Y, and Z. We are just a small part in a much bigger machine. However, I'm a huge sci-fi nerd. And the way I see things unfolding is pretty exciting from my perspective. And so now, particularly in the field of medical and longevity, health span is kind of one of the words that's used very frequently. If you think about the way that doctors typically see a patient, treat a patient,

[00:17:00] this is usually done on a symptoms basis. You rock up, you've got something wrong with you. They look at an isolated moment in time, isolated symptoms. If you're lucky, they might have your medical history for the past couple of years, so they've got a bit more context, and they'll come up with a plan to sort out this one particular thing. And in more cases than not, they'll just treat your symptoms. That's just how kind of the current medicine world works. And what AI could unlock, and I hope does unlock,

[00:17:30] is just a much greater context window, where a model is able to ingest huge amounts of data, essentially your entire medical history and your entire, perhaps, biometrics, because more and more of us are wearing things like this or things like this that are capturing tons of information. And so now the context window is not a couple of small paper folders on the doctor's desk. It's instead 30, 40, 50 years worth of biometrics,

[00:17:58] 50 years worth of perhaps dietary information, blood pressure, and all these different things, as well as your detailed medical history. And so when you're asking a doctor that's augmented by a system with this much context, you're just going to get a much more informed answer. And so being able to spot things early, being able to prevent degenerative disease before it becomes a thing, will become something that's much more normal. It won't be this thing where you need to go and pay hundreds of thousands of euros per year

[00:18:28] or hundreds of thousands of dollars per year to see some of the best doctors, because everybody will have that power sitting on their desk as they're speaking to you. And then I think opening the context window makes a ton of sense when you look across, rather than just one person, but all people or as many people as possible. And I think that's where pattern recognition and ability to create new medicine, new all types of stuff will become frequent. And I live for it. That's partly why we do what we're doing,

[00:18:58] but definitely can't be sitting here and saying, hey, Sapien's playing a big role here. We do do little bits of medical work, and we're proud of the work that we do for all the different industries that we work in. But ultimately, AI is advancing very, very quickly. And there are companies that are specialized in exactly this one part of the puzzle. We are not here to focus on one particular data type. Instead, we're here to provide a framework or a system that really enables knowledge transfer

[00:19:28] from humans to machines, even though I fully realize that sounds a bit black matter and dystopian. But that's the framework that's needed in order to get us to the point where these models can truly be useful in ways that they're not really capable of doing. Sounds exciting to me. Sounds really good. Good. So, yeah. I think it's – I see blue ocean for opportunity.

[00:19:55] So, you know, I want to address, you know, the data, right? Not your data in particular, but the data in general. You know, what are some of the drawbacks of relying on recursively generated or synthetic data? And what should we use instead? We should use human insights, right? But what are the drawbacks of that synthetic data?

[00:20:22] So typically, you'll use synthetic data when you want to fast track a process. You want to teach a model quickly, and you don't have enough data. And so you use a base data set, you generate a bunch more context, and then you trade on the back of that, and you essentially run simulations, right? And so it's a great way to fast track on the initial kind of scoping out and the initial creation. But there's always going to be a need for this fine-tuning piece because you're always going to have potential for hallucination,

[00:20:52] potential for false – I mean, if we go back to, like, medical terms, false positives and false negatives, where it's not using ground truth information. It's using something that it's essentially made up to continue learning from. And so if we think about autonomous vehicles, for example, Waymo is a company that we do not work with, but Waymo is a great example here because they've been training specific cars – I think they're Jaguar F-Paces or something like this,

[00:21:19] the white ones that you may see in San Francisco – to drive around San Francisco. And they are gathering tons of context, and they've become pretty good at driving safely, taking passengers, not crashing, not running people over, not doing crazy stuff, just safely navigating the city. Now, the problem comes where you pull the car out of San Francisco and you say, hey, let's bring it to San Diego or let's bring it to Los Angeles or let's bring it to Vegas or whatever.

[00:21:48] Now it's in a completely different environment, and it's been trained on data that looks nothing like what it's now trying to do. And so this is kind of similar to when you're using a bunch of synthetic data. It's just not always relevant to what you actually need. And what you actually need is the data that's specific for the task that you're trying to train it on so that it can actually do those things. Now, there is a world in which it can be smart enough to learn from one data set and then extrapolate out from there.

[00:22:15] But what Waymo found, unfortunately, is that even after years of training in SF, the cars can't drive in another state. It just doesn't work. And so the downsides here, there are many. I am not an engineer, and so I'm not going to go into them in enough detail for anyone that is an engineer. But the bottom line is the higher the quality of the data set that you use, the better the outcome you will have. And so if you're willing to take a little bit of a risk to shortcut a process

[00:22:42] and get a model off the ground at a first pass using a ton of synthetic data, it's probably the fastest way to do it. But you'll always need to then go back and figure out how to hone that data set and get it to the point where you're able to train for exactly the outcome you're looking for. I have a car example. It might come in handy, you know.

[00:23:07] But I want to talk about, you know, you rely upon a global contributor network instead, right? So what are the benefits and the drawbacks of that network compared to the current models that exist? Let's just continue with a car example. Waymo can't drive in San Diego because all of its data was created in San Francisco. Cool. Let's train using data from every country in the world. Well, guess what?

[00:23:34] We're going to have to drive in every country in the world because now we have local relevant context from people that actually know the area that we're trying to train for. And so that really is the unlock here. It's not having a bias in your data set. That bias could be geographical, San Francisco, or it could be age and sex related. It might be that you've targeted young gig workers in the Philippines that want to earn a few bucks an hour.

[00:24:01] And so now all of a sudden your model thinks like a 20-year-old male Filipino. And like, maybe that works. Maybe your model is meant to think like that. But if it's a model for a broader audience, the chances are that's not a desirable outcome. And so having the ability to lean into a globally distributed group of people to get lots of different types of context or expertise or a nuanced understanding of the world around them is super powerful for the enterprises building these models.

[00:24:29] And it's really the only reason that we exist. There's demand for this type of data. And we're building a new system, a new way of being able to link the supply up with that demand. Got it. So my car example is a lot simpler. For self-driving cars, like the other day I was driving, I was at a four-way stop. Okay. So I was the fourth car to be at the four-way stop. The guy on my right goes straight. The guy on my left goes straight.

[00:24:58] The person who's across from me goes straight. And it's my turn's next. I go to go left. But the person behind the person ahead of me, straight across, doesn't stop at the stop sign and goes through. I honk at them. They give me the finger. I'm like, how do you account for human stupidity in these models? Because that probably wouldn't have been calculated by the AI bot. You're absolutely right. And it's back to randomness.

[00:25:27] We get asked a lot, what happens when all of the models become so smart that you're not needed? And maybe there's a future where that exists. But my current gut is that randomness is on our side, in that random things happen all the time. And being able to have a good, safe answer for every single random thing that may happen on the entire planet, it's pretty unlikely anytime soon.

[00:25:54] And so consistently, we will need human opinion, human input, human insight, human context to steer that decision whenever something new that's random happens. Even if we manage to get over the initial kind of let's transfer all of human knowledge to AI, which we are a very, very long way away from even scratching the surface of today. Got it. Thank you. So I want to shift gears a little bit.

[00:26:24] You know, you have an ambassador program, right? I want to find out how that works and why it's important for people to take part and how they can. Yeah, it's a key part of our go-to-market. We are a young company. We work with 27 pretty large enterprise businesses, including some of the largest in the world, which I guess is testament to the demand. But ultimately, we're a startup.

[00:26:50] And so we're getting over the cold start problem of bring enough demand, bring enough supply, marry them together. And away we go. We have a marketplace now. But a key part of the building supply, a key part of building a network of people is to have people jump in, enjoy using the product, earn some money and tell their friends about it. And so we have a referral program that's ongoing.

[00:27:13] And that enables anyone who refers their friends that then comes in and starts providing knowledge, information, data and gets paid for it. They provide a little rev share. And so if you refer your friend, you can earn up to 5% of your friend's income over the next year. And so that's the way that we incentivize kind of some vital growth. But beyond that, if you want to be really going out and referring a ton of people, we have essentially an affiliate style program. We call it the Ambassador Program.

[00:27:42] And we will support these folks in whatever way they need to go out into their communities and help educate people that Sapien exists. And it's a cool opportunity to earn some money on the side, depending on what tasks we have live at any given time. And yeah, the way you participate typically is jumping into the Discord. And there's a ton of instructions in there about exactly how to become an ambassador and what the incentive structures look like.

[00:28:11] But even aside from that, if it's not something people are interested in, everybody who signs up has a referral code. And everybody can use that referral code to refer their friends and start earning some extra income on top of whatever it is they earn themselves. So kind of two paths, if you like. And what are some examples, some recent tasks that you've seen, you've offered? Oh, this is actually a really good question. I've talked about autonomous vehicles. I've talked about medical imaging.

[00:28:40] I've talked about some pretty sophisticated data tasks. If you log into our platform right now, you're not going to see those. And the reason you're not going to see those is that we need to build a few key parts of the system to bring the tasks that are currently in a permissioned environment out into the open so that everybody can participate. Right. So think about us as kind of being in like beta or pre-production today.

[00:29:32] We've got six or seven of them. There's things like a car. I'm not going to name their name, but a car history company asked us to help them recognize vehicle identification numbers in images. And then that morphed into helping them teach a model what they're looking at in terms of direction the car is facing. And so that's one of the kind of most popular tasks in our front end today. But I think more interesting is the stuff that's happening in private and that will soon be migrating into public.

[00:30:02] And just to be clear, we have a reputation system kind of in very early alpha. That's a key part of the kind of unlock. When that moves into production, we can start bringing things over. And the next part is a kind of onboarding wizard, if you like, like a qualification flow. And that pretty much just means that rather than our operations team checking to make sure this is the right person for the right task, we automate that. Everybody builds reputation. They can then onboard themselves to any of these tasks.

[00:30:31] Some of them will be anonymized so that the customer information is private. A lot of these companies don't want their peers to know kind of where they're sourcing the data that they use for their models. But cool examples, we've got the autonomous vehicle stuff. That's something that's very close to my heart, something that I'm a big fan of. We've got a bunch of audio speech recognition tasks ongoing. We've got, I actually mentioned the handwriting one already. Medical imaging has been something that we've done quite a few times.

[00:31:01] We have survey information. We have translation works. There's a whole bunch. There's a massive variety of different types of tasks. And really this fluctuates on a weekly basis. So if you rock up and you log in and you don't see a task that's immediately appealing, check back in a few days. The chances are there'll be something new. And similarly, if you do a task today and you enjoy it and you come back tomorrow, it may not be there.

[00:31:28] Because there's only a certain amount of data required for each of these things. And if it's full, it's full. The data's been delivered. We move on to the next thing. Sounds exciting to me. Sounds really good. So I want to thank you very much for speaking to me today. I'm excited for what you guys are up to. And I look forward to seeing your roadmap play out as moving forward. I have one last question. It's how can people find out more information about you, about Sapien?

[00:31:58] How can they participate? How can they do that? I don't think anybody wants to find out more about me. But Sapien is at playSapien on X. Or they can simply go to game.sapien.io and click earn now if they want to sign up and start checking the tasks themselves. We also have Discord and things like that, Telegram linked in there.

[00:32:20] And if for any bizarre reason someone does want to keep an eye on what I'm doing personally and where we're kind of moving as a business, then at RowanRK6 on X. And I really appreciate your time. Thanks for having me. You're welcome. Thank you very much for your time today.

Tapping Into a Global Contributor Network to Source AI Verified Insights, with Rowan Stone @ Sapien (Video)

Follow Us on LinkedIn

Important Links

Powered by