DUE TO SOME HEADACHES IN THE PAST, PLEASE NOTE LEGAL CONDITIONS:
David Yakobovitch owns the copyright in and to all content in and transcripts of The HumAIn Podcast, with all rights reserved, as well as his right of publicity.
WHAT YOU’RE WELCOME TO DO: You are welcome to share the below transcript (up to 500 words but not more) in media articles (e.g., The New York Times, LA Times, The Guardian), on your personal website, in a non-commercial article or blog post (e.g., Medium), and/or on a personal social media account for non-commercial purposes, provided that you include attribution to “The HumAIn Podcast” and link back to the humainpodcast.com URL. For the sake of clarity, media outlets with advertising models are permitted to use excerpts from the transcript per the above.
WHAT IS NOT ALLOWED: No one is authorized to copy any portion of the podcast content or use David Yakobovitch’s name, image or likeness for any commercial purpose or use, including without limitation inclusion in any books, e-books, book summaries or synopses, or on a commercial website or social media site (e.g., Facebook, Twitter, Instagram, etc.) that offers or promotes your or another’s products or services. For the sake of clarity, media outlets are permitted to use photos of David Yakobovitch from the media room on humainpodcast.com or (obviously) license photos of David Yakobovitch from Getty Images, etc.
You are listening to the HumAIn podcast. HumAIn is your first look at the startups and industry titans that are leading and disrupting artificial intelligence, data science, future of work and developer education. I am your host, David Yakobovitch, and you are listening to HumAIn. If you like this episode, remember to subscribe and leave a review. Now onto the show.
David Yakobovitch
Welcome back listeners to the HumAIn podcast, the leading show on applied AI data science and Ml. Today we’re bringing to the show serial entrepreneur Steven Banerjee. I’ve had the pleasure to meet Stephen in Silicon Valley when we’ve been running our tech dinner series in San Francisco. I recently met him at the Press Club, the four seasons and also at our SingleStore office had some lovely conversations talking about all things data.
Stephen is a serial entrepreneur who previously was the founder of Mekonos and today is launching his latest venture, which is focused on natural language processing in biomedical research, Explainable AI and drug discovery. Steven, thanks so much for joining us on the show.
Steven Banerjee
Hey, thanks, David. Thanks for inviting me over.
David Yakobovitch
Well, I love all things data. And I love founders who are building especially in the industry, and yourself as someone who is a repeat founder. We also for the audience here got to meet each other also from the on-deck community on deck for many of you listeners is the new YC of the 21st Century. So if you’ve never checked out on-deck, be sure to check it out. And I know you’re back to also buy on deck, which is very exciting.
Steven Banerjee
Yep. That’s correct.
David Yakobovitch
Excellent. Well, hey, start off and let everyone know a little bit about your background, your built tried and true on the West Coast. So I like to hear about what you did and what you were scaling. And what’s led you to be a repeat founder.
Steven Banerjee
Sure. So, a little bit of background here. So I am a mechanical engineer by training. And I started my graduate research in semiconductor technologies with applications in biotech almost more than a decade ago, in the early 2010s. I was a Doctoral Fellow at IBM labs here in San Jose, California. And then I also ended up writing some successful federal grants with a gene sequencing pioneer at Stanford, and Ron Davis, before I went, ended up going to UC Berkeley for grad school research, and then I became a visiting researcher.
So, during the course, I eventually founded my first company Mekonos in 2017 to pioneer in a cell and gene therapy delivery platform, using saw concept technology that I previously was working on. And during my time of founding and running the company, I came to realize the tremendous pain that exists in this industry, in terms of acquiring and accessing biomedical knowledge, evidence, and insights that are actually hidden and buried within all this in a very disjointed siloed. And remarkably messy in biomedical data sources. And a lot of this data exists without context, they do not share a common language.
So, in late 2020, he ended up starting this new company called NextNet, in order to leverage the winning kind of the recent breakthroughs back then in natural language-based AI. And with a vision of bringing this extendable version of this AI to biotech and pharma. So, I’ve been working on this since late 2020. And we are at a pretty exciting juncture today.
David Yakobovitch
It’s always very exciting when founders become repeat founders. And I think what’s really exciting is you spent a lot of work in the lab working in deep tech working in hardware and software. And there must have been things that you discovered when you were building your first company when you were doing all this research at Mekonos. And then you had some aha moments, and then decided, let’s do it again. What excited and encouraged you to be a repeat founder?
Steven Banerjee
That’s a great question, David. So, one of the key motivations for me personally, is that your biomedical r&d is a very broken system. And I’ll actually, maybe, spell out some facts here. So, there are actually, studies done, there are around 23 and a half 1000 or more diseases that’s known to mankind, and out of which probably a little over a little less than 5% actually have some form of cure or treatment. You’ve got hundreds and millions of people to actually suffer from let’s get into genetic disorders and unfortunately, a lot of that stems from the way your biomedical r&d is conducted. Very, very expensive, very highly risky.
An average cost of bringing a drug to market is around $2.6 billion. It takes around 10 to 15 years, like from the earliest days of discovery, to launching into the market. And unfortunately, more than 96% of all drug R&D actually fails . This is a really bad social model. This creates this enormous burden on our society and our healthcare spending as well. One of the reasons I started NextNet was when I was running Mekonos, I kept on seeing a lot of our customers had this tremendous pain point of, where you go, there’s all this demand and subject matter experts, as scientists, they’re actually working with very little of the available biomedical evidence out there. And a lot of the times that actually leads to false discoveries.
And there’s ample evidence in scientific literature where false discoveries likely greatly outnumber true discoveries in preclinical research, and hence, the bad decisions are made. And whether, you’re selecting new targets, or you’re designing a new drug in terms of molecular design, or you’re performing in a clinical trial, and so forth, access to a very limited amount of biomedical knowledge can lead to innovative decision making.
And one of our core motivation was, let’s try to build a platform by leveraging the latest in language AI, so that we can enable the scientists to be able to query and access all that information, all the biomedical knowledge, like the totality of biomedical evidence out there, so that they can understand this underlying molecular mechanisms or diseases and so forth. And that we see as critical to the development of effective and efficacious drugs.
So, the way we put it, is like, we’re building this visual infrastructure for biomedical R&D and decision making. And our goal really is to liberate the people involved in this drug discovery and development, so that we can liberate them, and give them access to all the knowledge and insights out there that currently, they don’t have access to.
David Yakobovitch
Now, you’re speaking music, my ear, because when you talk about querying data, often today, we see with the leading data analysts, data scientists, most of the work still is done with queries, right. And these queries can be in different languages. Sometimes they’re low-level languages, like Rust and C++ are built on top of compilers like wasm, and web assembly, or even more modern with Python or SQL. So queering is so important to get the heart of data to understand all the analysis. I’d like to hear a little bit about the data that you’re seeing this big data in biomedicine, I mean, what are the file types? What are some of the data that you and the team are seeing?
Steven Banerjee
Absolutely, David. So, that really kind of cuts to the core of what we’re building, right. So you have got all these different, biomedical data that are very disjointed, disparate, remarkably messy, you’ve got this all this variety of different modalities of data, you’ve got our DNA, our new sequence, protein functions, gene expression, biological pathways, disease, databases, imaging, you’ve got scientific literature, just to give you some kind of a breadth of the data available. So every 10 seconds, there is one life science paper being published. And then you’ve got 10 million gigabytes of molecular data, like sequencing expression, time data available per scientist.
And think about this: our human body, it’s a living, breathing big data system, you’ve got 37 trillion cells, packed full of data with billions of chemical reactions per second. It’s a massive data system. And unfortunately, the way a lot of this data exists, it’s very non-uniform. So non-uniform, not only in terms of the nomenclature of the biological terms, so for example, the genes and proteins can have these different aliases and synonyms and ideas from different data sources. And data schemas are highly inconsistent, they’re not machine-readable. You rarely have a control like controlled vocabularies, ontologies, in all this highly unstructured data.
So for example, let’s say if you want to find, if you want to query, let’s say, Hey, show me all the Type 2 Diabetes studies where a certain gene of interest, let’s say some gene is differentially expressed, that could take weeks, if not months. And that was a big problem because you have a lot of tools that are available. And a lot of these tools are open source. These are like software and databases that are mostly command line centric, they’ve got very technology-wise.
And they’re like, literally 1000s of these tools. And these tools were built piecemeal. And they were created for the specific needs of a certain academic research group. Right. So this was built on grant money. And what happens is that a lot of his tools quickly become outdated, because of lack of maintenance on the back end, because the grant money kind of runs out. And then the people that actually develop the tools, they kind of leave the project.
And so there are tools, they’re all this plethora of bioinformatics tools and software and databases out there that are plagued with program bugs. They mostly lack documentation or have very complicated documentation and best, very technical UI’s. And for an average scientist or an average person in this industry, you really need to have a fairly deep grasp or a sophisticated understanding of database schemas and SQL querying and statistical modeling and coding and data science, and in a mastering mirrors are our libraries and Python packages and databases.
And, the problem with that is that not only, all this biomedical data that’s out there is siloed. But the people working with this data have siloed as well. And so, biomedical data with a context of collaboration, and this kind of siloed thinking, actually adds very little value. Now, just because it’s available like all this data is available doesn’t mean that it’s usable. And that’s what we have, what we’re seeing, and that’s what we’re here to solve.
David Yakobovitch
One of the bigger issues we’re seeing all across every industry with data is more data is better. But waiting for more data is better, but what can you do with it? How can you work with it? How can you put it into this holistic system or one integrated system, where everything’s connected, and then the data analyst, data scientist, ML engineers, the bio informaticists, can actually understand how to parse that information. And it sounds like a lot of that that you’re focusing on is around text. And, these research papers and these studies, and as you mentioned, every 10 seconds, one’s coming out.
If you had to read every 10 seconds, a 30 to 100 page research paper, not possible. In addition to the datasets that’s being shared for reproducible analysis, the amount of data growing is exponential. And so this begs the question, how do you create a robust system that’s using full text search, that’s using semantic analysis that’s diving deep, so that ultimately, research can be done better, faster, cheaper, and sounds like next net is at the heart of that of helping to pioneer that NLP for drug discovery and development?
Steven Banerjee
Absolutely. So, David, one thing to clarify, people often think that we’re just dealing with text, one of our core, one of our core differentiators is that we are not only going after text, like scientific literature and patents, and reviews, and so forth. But we’re also going after more molecular data sources, like disease databases, and DNA sequencing, and gene and protein expression databases, and so forth.
So, we’re kind of extracting information from all of those. But I think to answer your question here, so we have built probably the most sophisticated NLP stack that allows us to analyze this vast corpus of disjointed and very multimodal data, and then connecting knowledge and concepts extracted from them, reasoning across all this interconnected landscape, and then providing scientists with the research capabilities beyond human insight.
So the way to think of this in a more simpler term is, Sapiens is basically it’s taking it’s of biology and chemistry by reading papers, textbooks, extracting knowledge from different gene product databases, understanding all of that contextual learning all that information, then updating itself every day by reading, hundreds to 1000s of papers a day, millions of abstracts a month, ingesting all this massive sequencing, expression databases, and so forth.
Now, an average scientist can read up to potentially I would say, around 400 papers a year max, right. So just in the last 24 hours, who had around 10,000 papers published, right? So with that rate, at which an average scientist can read all that information, it would potentially take the person 20 years to go through all the information that was published just in the last 24 hours. So what Sapiens does, and it does really well is it’s able to extract knowledge, concepts, facts, and connect all of those into this massive Knowledge Graph. So we have built probably one of the most sophisticated knowledge graphs, you know, at the heart of what we’re building, and then we are building applications on top of that.
So, we have discovered close to 100 million contextualized machine curated relationships. And our goal is to up within the next 12 months to reach the mark of a billion relationships between biological concepts between facts between biological entities, like cells and genes and proteins and disease and pathways and metabolites, and so forth. And then on top of that, we have built applications such as search and discovery.
So, you can ask Sapiens abstract questions, like what are the biomarkers for such and such disease? Or how pathway x affects genes? Why in some cell type C? or what are the research trends for certain area x. And so you can ask those kinds of questions. And you can begin to analyze the results. So Sapiens basically surfaces, resulting in the form of relationships between concepts, biological concepts, and these entities. And then you can manipulate those connections in the network. And you can also upload your own data.
So let’s say you’re performing some biological experiment, and you’re generating images and RNA sequencing and so forth. And so we’re building the workflow which is completely cordless in zero code, where you can upload those internal experimental data in proprietary data onto the plant. From securely, and then transform all of that into this massive network. And really begin to contextualize that information and begin to manipulate those relationships and generate actual insights, generate some hypotheses, and test and validate those hypotheses, share that across the enterprise within your team, and then take informed decisions both scientific and business decisions. And that’s a very powerful thing.
David Yakobovitch
And from everything that you’re sharing, Stephen, look, there’s been a variety of models built over the years with a variety of of text corpus of data going from the millions to now the trillions of parameters, we think of open AI we think of GPC we think of birth as just variety of models, and your team’s building state of the art, what is state of the art NLP look like for you in the team next night?
Steven Banerjee
I think a lot of the work that we’re doing could not have been done like two or three years before this, because some of the advancements that have happened in the last two or three years have been really phenomenal. So just to give you some perspective here, less than a decade ago, in order to understand what text is about your AI algorithms would only count how often certain words occurred. Now, the issue with this approach was that it kind of ignored the fact that the words of synonyms can only mean something that they’re contextualized. But your recent progress in NLP research has been accelerated because of the adoption with this kind of self supervised learning from in a very large scale data and the transformer model architecture.
So, a transformer is potentially one of the greatest breakthroughs that has happened in NLP recently. It’s basically a neural net architecture that was incorporated into NLP models by Google Brain researchers that came along in 2017 and 2018. And before transformers, your state of the art models and NLP basically were like, LS TM, like long term memories are the widely used architecture.
And those were based on recurrent neural nets. And, by definition, this kind of recurrent neural net architecture processed sequentially, that is one-word or word piece at a time Warner did those words appear. But what happened around 2017 2018, when transformers came into the picture, the transformers with innovation was to make language processing parallelized, meaning that all the tokens and by token, like, it can be a word or a character sub words.
So all the tokens in a given body of text are analyzed, at the same time, at least within a window, rather than impure word sequences. So in order to support this kind of penalisation, Transformers basically rely on this AI mechanism called attention, which basically, as the name suggests, is pretty much exactly the same thing. That is how we get this kind of AI model to attend, to the important semantic parts, as we do as humans. And so tension, if implemented, can enable a model to consider the relationship between words, even if those words are far apart, and that takes around a big sentence.
And then to determine which words and phrases in a passage are most important to pay attention to. And I think just really allowed this kind of transformer architecture to learn the meaning of a word in relation to its context in a sentence. And that really solves a key problem, David, which was in a problem with previous LSTM type models, this kind of recurrent neural network type models.
But of course, with this kind of penalisation means that it’s vastly more computationally, it’s definitely vastly more computationally efficient and like recording, like these recurrent neural nets. And it can be trained on larger data sets and built with more parameters, but of course, it all comes at a big price and is incredibly resource intensive, very technically challenging. And very few companies are researchers that actually build their own NLP model from scratch. So, it takes enormous amounts of this computational resources and engineering to train models, and these massive datasets with millions of billions of parameters.
And so, you may have heard of like this GPT, three, and Bert models and so forth, that require like several, I think, that require like several 1000 Peda flops or something like that petaflop per second days, or something, one of one of those parameters that they actually measure into training, and then millions of dollars per train. And it’s very complex, very costly to build, and most companies simply don’t have the resources to build such large language models from scratch.
So, virtually all advanced NLP in the US today, actually stems from when and this is like, irrespective of the industry, the setting there based on a small handful of massive pre-trained language models. So some of the language models are, for example, birthed from Google and reverted from Facebook and GPG, like Gmail from open AI and so forth.
And we are actually leveraging some of these models, this kind of language generation and representation models, and then we have built our own proprietary NLP in architecture on top of that, and a lot of the tools in house and that has really allowed us to be able to, you know, extract automate Do knowledge extraction from all these very diverse disparate data sources and in the biomedical arena, that would just not have been possible just two or three years ago.
David Yakobovitch
It’s super fascinating to see the journey that Stephen, you’ve shared, with our listeners about how the entire NLP industry is continuing to evolve. And how this means a variety of new products is continuing to evolve.
And those products include access to parallel processing, access to cloud compute access to larger datasets. And you were sharing at the onset here that you and your team have built what is known as Sapiens, which is what powers NextNet, can you share a little bit more about under the hood on, now that this technology is becoming more mainstream? How are Sapiens leveraging that? And what are some of the features that Sapiens is going to be helping in this industry.
Steven Banerjee
So the thing that powers Sapiens is this massive knowledge we have that we have built at the core. And we have, as I was telling him, before that, we have discovered close to 100 million AI derived machine curated relationships between biological concepts, entities, and so forth. So I want to give you some perspective here. So 100 million machines cure relationships. Contrast that with one of one of the leading companies, this is a, kind of a classic incumbent in the field. And I will probably not name names here.
They have, I was reading one of the facts sheets, they have discovered close to 6 million biological concepts that they have curated over over 20 years, we have reached close to 100 million relationships between concepts and facts, and so forth, within a little over a year. And again, this is based upon some of the latest advancements in NLP, and so forth. And we are fastly reaching the mark of potentially close to a billion relationships within the next 12 months or so.
So what we have built are automated platform Sapiens to help companies identify new targets, generate new hypotheses by understanding the underlying cause of a disease, help them guide the research for molecular or like therapeutic design to develop the most effective medicine to treat the disease. And potentially, as we move forward, we also see applications of Sapiens where it can be used for clinical trial purposes by for patient stratification, identifying patient subgroups to understand how individual patients will respond to personalized treatments and so forth.
So Sapiens is, our goal here is to really make biomedical data accessible and useful for scientific inquiry, using this platform, so that, your average person and industry, let’s say a wet lab or dry lab scientist, or a VP of R&D or CSO, or let’s say a director of research can ask an answer complex biological questions. And a better frame hypothesis to understand is very complex, multifactorial diseases. And a lot of the insights that Sapiens is extracting from all this, with publicly available data sources are proprietary to the company. And then you can map and upload your own internal data, and begin to really contextualize all that information, by uploading onto the Sapiens.
And, in terms of conceptualizing with the insights, the surface by Sapiens, and then rapidly test and validate hypotheses, and hence reduce the development time, cost and failure rates. And, one thing, David, I really want to make sure that our audience understands. NextNet is not a data broker or aggregator; we see ourselves as a digital infrastructure for biomedical r&d and decision making. And that kind of goes back to the original thing, we were talking about how bad decisions can lead to drug failures, because our domain and subject matter experts are working with very little available biomedical evidence. We’re trying to liberate them so that they can have access to, kind of the totality of biomedical evidence out there, right. And Sapiens actually does that.
So, we are rapidly growing a platform. And we have just recently started limited trials itself, early adopters, this includes academic research labs and some selective enterprise clients. And over the course of this year, we fine-tune a lot of our platform UI and the workflow and so forth.
David Yakobovitch
Very exciting. And I know at NextNet that you’ve been also growing the team you’re currently hiring, and now you have, among other roles, a full stack engineer open. So can you tell us more about growth, whether that means through fundraising or through the team, as NextNet is going through this next chapter?
Steven Banerjee
That’s very relevant. David, thank you for asking. So, we are better testing the platform. Basically, we have built a search and discovery feature. So we are beta testing selective enterprise clients and some academic research labs, and the goal for us during the course of this year is to really work closely with some early adopters, to refine the UI and the graph navigation, the knowledge collaboration tools and the workflow and so forth.
We have also been raising a financing round so we’ve got half of that investment started and we are looking for additional investors to come and join us in this journey. And the goal for us is to really raise a larger round, sometime early next year, and then scale this platform to like, at least half a dozen, six to eight enterprise clients and, potentially a dozen academic research labs sometime next year.
And also build out a leadership team and also go for, as you call them elephant hunting, bring onboard and large bar from our beta customers, and continue to build our IP and leverage our key differentiators that we have to also keep our competitive pricing and then go on to raise more significant Series A after that, and really kind of expand from there expand in a customer-facing teams like support sales engineering, accommodate new customers, partner with other incidents like instrumentation, life science, in R&D companies, that can offer Sapiens as a part of their product offering and really, continue to build out the team because, one of the things that’s essential, here we are building at the end of the day, in our software platform, and selling it to the about, to the Baltic and brought from customers will definitely be an interesting journey.
So, I’m looking forward to bringing people not only in technical roles, like full stack, and UI, UX, and mathematics, but also, people in sales and marketing as well. And I think we are looking for investors, and we’re also looking for people to join us in this journey, who can help us build this platform.
David Yakobovitch
That’s exciting. And if we take a step back and look at a longer term roadmap, every founder has a great vision of where they see the industry, where they see their product, and where they see change and disruption. Can you paint a picture of if we’re looking 5-10 years down the road? In drug discovery and development? Where would you like to see the industry grow? And how will that picture change?
Steven Banerjee
First and foremost, of course, I would love personally to see more and more drugs getting approved. And, the failure rate decreases, because as I said earlier, it’s a massive societal cost, the way biotech R&D occurs with 96, more than 96% of drugs fail. And even like,I was reading this, like the top 10 to 15. Selling drugs actually only work on like 30 to 50% of the patients at most. So we’re almost failing 70% of the patients.
And I cannot imagine another industry where product failure is just rampant. So one of my vision, and one of our goals here is really make a lot of this information and knowledge that are buried within all this solid data sources accessible to an average person in this industry, and accessible without writing a single line of code, so that you can search, discover general hypothesis, test and validate shared that within your enterprise and really make a massive impact in the way, the organization performs about a guarantee, we are looking for organizational transformation, in the way the biotech r&d occurs, and the decisions are made. And, we see ourselves as I said earlier as a digital infrastructure for biomedical r&d and decision making.
And, our goal is to really kind of become, and it’s kind of a as a part of our mission statement is our mission should do for Biomedicine with Microsoft’s OSD for virtual computing, to really empower scientists to empower an average person in this industry, to ask and answer all this complex questions with a mastering coding, coding languages rockin statistics, and be able to get to that knowledge and be able to share that knowledge and to really transform the biotech R&D as a whole.
David Yakobovitch
I’m excited. I think that today, we’re learning a lot about the journey of applying new applied AI, applied Ml and applying data to the industry. So there’s a lot of potential there, a long, exciting road ahead. What message would you like to share with our audience as a takeaway today, Steven?
Steven Banerjee
We are definitely looking for early adopters. This includes biotech companies, pharma, academic research labs, that would like to test out Sapiens and like this to be a part of their journey of their biomedical R&D. We’re definitely, as I said, looking for investors who would like to partner with us, as we continue on this journey of building this probably one of the most sophisticated natural language based platforms, or as we call it, an excellent AI platform.
And really, my message to the audience is that, watch out for us, our goal is to really transform the robotic R & D that occurs. And that’s my personal mission. withI founded my first company with that vision. And I have been continuing on this journey with the next step to be able to do that. And I feel very grateful and lucky to be in this to be doing all of this with a team of really brilliant people. So that’s my message.
David Yakobovitch
Excellent. Well, this episode has been with Steven Banerjee the founder and CEO of NextNet. Steven, thanks so much for joining us on the show.
Steven Banerjee
Thanks David.
David Yakobovitch
Thank you for listening to this episode of the HumAIn podcast. Did the episode measure up to your thoughts and ML and AI data science, developer tools, and technical education? Share your thoughts with me at humainpodcast.com/contact. Remember to share this episode with a friend, subscribe, and leave a review, and listen for more episodes of HumAIn.