Secure RAG Systems: A DeepTech Exploration with Protecto’s COO, Protik Mukhopadhyay

Protik Mukhopadhyay is the Chief Operating Officer (COO) at Protecto.ai, a venture-backed company specializing in secure and privacy-focused Retrieval-Augmented Generation (RAG) solutions. With over 15 years of experience in artificial intelligence, large language models, and data privacy, Protik is a seasoned entrepreneur and thought leader in the AI industry.

OUTLINE:

3:03 RAG Systems Key Dimensions

5:46 RAG Implementation Challenges

8:31 Effective RAG Use Cases

11:16 AI Ethics in RAG

14:01 Protecto’s Data Protection Approach

17:31 RAG Development Lessons Learned

20:16 On-premise vs. SaaS Deployment

22:46 Role-based Access in RAG

Episode Links:

Protecto AI: https://www.protecto.ai

Whitepaper: https://www.protecto.ai/trustworthy-ai-whitepaper

Sign up for a GenAI Strategy Roadmap Session: https://aistrategynow.com/

Protik Mukhopadhyay’s LinkedIn: https://www.linkedin.com/in/protikm/

Protik Mukhopadhyay’s Twitter: https://twitter.com/protik_m

PODCAST INFO:

Podcast website: https://www.humainpodcast.com

Apple Podcasts: https://apple.co/4cCF6PZ

Spotify: https://spoti.fi/2SsKHzg

RSS: https://feeds.redcircle.com/99113f24-2bd1-4332-8cd0-32e0556c8bc9

Full episodes playlist:  https://www.humainpodcast.com/episodes/

SOCIAL:

– Twitter: https://x.com/dyakobovitch

– LinkedIn: https://www.linkedin.com/in/davidyakobovitch/

– Events: https://lu.ma/tpn

– Newsletter: https://bit.ly/3XbGZyy

Transcript:
David: Today, we’re joined by Protik Mukhopadhyay, COO of Protecto.AI. Protik, it’s been a nonstop, fast-paced year in all things AI, especially in what your company’s building in the space of RAGs. Could you share with our listeners a little about the startup and explain what retrieval augmented generation is and why it’s increasingly significant for IT and business environments?

Protik: Absolutely. There are patterns on how to develop applications. When we were developing web applications, the common pattern was called MVC – model view controller. As companies adopt generative AI and want to include their own proprietary information in their OpenAI integration, the common pattern emerging is called RAG or retrieval augmented generation.

RAG allows you to augment generated content with your own knowledge base at the time of retrieval. It’s essentially an application development pattern for generative AI applications that allows companies to bring their knowledge and proprietary data, marry that with generative AI, and allow their customers, employees, and partners to be more productive.

There are ways to build RAG based on your preferred platform, whether it’s Microsoft, Google, or Amazon. What Protecto does is allow you to build for very specific sensitive data use cases. If you’re a healthcare or financial services company, we focus on allowing them to use sensitive data in their RAG pipeline, ensuring anything confidential or regulated (like HIPAA) follows the right protocols like GDPR, CCPA, or newer ones like the EU AI Act and NIST AI Act.

David: It’s so important. I think you and I have seen such an evolution in data and AI systems in the last few years, with RAG being one of the newest evolutions. Your company, Protecto, offers a secure RAG solution. Could you walk us through some of the key technical architectural patterns in RAG AI systems and how they address security concerns?

Protik: I’ll talk about RAG in general, and Protecto is essentially one of the guardrails. There are complex strategies when it comes to building RAG, including chunking strategy and the choice of vector database. Think about RAG in four dimensions: data, latency, throughput, and accuracy.

For data, where Protecto focuses, we handle sensitive data privacy. We also focus on bias and fairness by replacing sensitive information with synthetic versions. This way, your systems aren’t using actual data but metadata about it.

Other challenges with RAG include retrieval time. Latency is crucial, ensuring a certain response time when retrieving relevant documents from a data store. Throughput is a concern with CPUs and GPUs costing a lot. We’re seeing a focus on such costs.

The most important aspect is the relevance of your response from the RAG system. Is it based on your current corporate policy? The relevance and accuracy of your retrieval are crucial. This allows companies to trust AI, improve adoption, and see AI as something that can make them more productive.

David: Thinking through those requirements, I imagine organizations might face challenges when implementing RAG systems. Can you tell us more about these challenges?

Protik: In our experience, it starts with data quality and consistency. Even pre-generative AI, maintaining high-quality data across sources was a challenge. Data quality is still one of the biggest challenges and opportunities for realizing the full value of RAG through generative AI.

We’ve seen many successful POCs, but post-production projects often lose steam or get canceled. You have to look at edge cases and scenarios. The business value you’re trying to realize through RAG applications should consider edge cases.

Hallucinations in generative AI are sometimes considered a feature, not a bug. How do you ensure that as you bring corporate data into your RAG pipeline, you’re focused on handling edge cases? There are data observability tools that measure model drift, so monitoring and managing model drift in your RAG pipeline is important.

Lastly, operational cost is crucial. For every project, there’s an ROI, but ongoing costs related to infrastructure, storage, and compute can get out of proportion. Sizing your environment, figuring out what kind of RAG pipeline to use, and understanding API nuances can go a long way in ensuring project success.

David: Of course, there are probably some specific use cases where RAG has been highly effective. Are there areas you’ve seen, like in customer support or legal compliance, with strong use cases?

Protik: Absolutely. Let’s start with creative writing. RAG systems can be really powerful for creating custom themes and genre-specific topics, especially for marketing departments. You can take all your prior IP and content, maybe your new brand positioning, and leverage that in a RAG pipeline to build new content faster and more accurately, following your brand guidelines.

Contract drafting for legal has become a common use case. Paralegals are now more productive because they can leverage corporate clauses, legal precedents, and newer legislation to create draft contracts and respond to legal letters.

For help desks, imagine a large company with thousands of technical support staff. How can I make the agent more productive? They’ll have access to their CRM system and knowledge base, but can I contextualize that and convert a user query to a prompt that allows them to be more productive? This can reduce tickets meaningfully. I’d say help desk has been the biggest winner for RAG in enterprises so far.

This can also help with product innovation. Information from support cases can guide product teams in authoring their roadmap.

David: Your company emphasizes the importance of trustworthy AI. With that in mind, what are some key principles that organizations should follow to ensure their AI systems are ethical and responsible?

IProtik:  think we should approach this more broadly – why limit it just to AI systems? In general, you should always look at ethics, privacy, and security as a design-first approach for any application.

The reason ethics becomes a challenge with AI is that the data you have may be inherently biased because it doesn’t have all datasets. You have to look at the applications you’re developing and consider the RAG systems you’re developing as one of the data points.

Ultimately, think about the response from your generative AI systems as a recommendation. You should never use it as your final authority. The decision of what action to take and how to ethically conduct yourself should still be with end stakeholders.

If you treat RAG systems as decision support or recommendation systems, it’s fine. The challenge is that even when disclaimers are provided, people often miss them. It’s very important to educate people that AI systems work on the data they’re trained on, and if the data inherently doesn’t have all datasets, there’s a good chance it may be biased or have limitations.

Based on your company policies, personal code of conduct, and local regulations, you have to use your best judgment to take action. Recommendations and actions are two different things, and confusion often happens when people mix them up.

David: Applying these thoughts on ethics, I suppose that Protecto’s solution helps companies with compliance on data protection regulations, especially when dealing with sensitive information like healthcare data. Could you unpack that for our audience?

Protik: Absolutely. Responsible AI is a broad umbrella. Where Protecto comes in is we help you find what’s sensitive in your datasets. It could be PII, PHI – we then mask it with synthetic data. We mask with synthetic data because we don’t want the data to lose context.

When you’re training for an LLM, it’s still thinking about a name and a phone number. We focus on the quality of the retrieval. Data masking technologies have existed for 20 years, but what makes Protecto different is that after detection, we also maintain referential integrity of that data.

If I have the word “David” 10 times in my support ticket, I’m going to use the same token. I’m not going to use inconsistent tokens. Even if I mask it with different tokens, while following privacy compliance laws, I would make the LLM super confused by using different tokens.

We replace sensitive data with synthetic data rather than random text so that the format is preserved. If the LLM is expecting a name, it gets a name; if it’s expecting a phone number, it’ll get a 10-digit phone number.

We put this in a secure vault and maintain a relationship between your synthetic version and the actual version. The LLM or RAG pipeline never gets what’s sensitive, but based on your access level and how you define your roles, you’re still able to retrieve the original data as well as synthetic data.

This allows for very granular applications. Your HR system may consider emails not sensitive and choose not to mask them, while your support system might want to mask all client emails. We enable that flexibility.

David: That’s great to hear about the mission and focus on data security. I imagine that working with clients and design partners has given you some lessons learned in developing robust RAG AI systems in production and deployment. Could you share some of those lessons or hurdles in moving from proof of concept to full-scale implementations?

Protik: Sure, David. It goes back to what I was talking about earlier: data quality, handling edge cases, and operational costs. These are very important dimensions to look at.

In the euphoria of generative AI over the last year, we’re seeing a lot of POCs getting abandoned for three main reasons:

  1. The business value promised by generative AI can’t be realized simply by enabling a co-pilot. You need an adoption plan. If you don’t develop a plan for how you’re going to roll out, educate, and allow your end users to be successful, that’s a recipe for disaster.
  2. The messaging and branding of these tools internally is very important. It allows people to see it from a business value lens and work backwards on the adoption plan.
  3. If you don’t have data quality, don’t take up these projects. Make sure you have consistent data quality in place first.

There are important tools for managing ongoing operational costs, even down to cost per query or cost per prompt. This will enable you to message correctly and prevent your CIO from shutting down your project due to unexpected consumption spikes.

Finally, make sure you do a CISO review early in your project. Ensure that from a privacy, security, and governance perspective, if the CISO has given a go-ahead or called out risks, you handle those as part of your implementation.

Protecto offers both on-premise and SaaS options. How do you help companies decide which deployment model is best for their needs, especially when dealing with sensitive data?

In all openness, 90% of our production deployments are on-premise. This is because we focus on US-based financial services and healthcare clients, which are heavily regulated. They prefer a tool like Protecto, which promises data protection, governance, and AI guardrails, to be deployed behind their firewall in their VPC.

We typically give them a container for on-premise deployment. The SaaS offering is usually a way for us to solve a small use case or for potential clients to sign up for a trial on our website. You can look at how accurate our detection algorithm is and how we manage sensitive data.

For the vast majority of clients, including a recent client where we’re deployed across 3,000 customers, it’s essentially behind the firewall in their VPC as a container. This gives us the flexibility to monitor based on their data volumes, add more CPU power, and apply application-level security on top of it.

We see almost an 80-20 split between on-premise and SaaS. When we say on-premise, it’s still on the cloud but in the client’s environment, in their VPC behind the firewall, typically on GCP, AWS, or whatever the client’s preferred cloud partner is.

David: Your solution at Protecto incorporates RBAC (Role-Based Access Control) into the RAG process. Could you explain how this enhances security and privacy in AI applications?

Typically, RBAC is defined at the organization level. Protecto doesn’t define RBAC; we inherit it in our API. Let me give you an example:

Let’s say your HR has rolled out a ChatGPT-like application for HR policies, and you’re a student in Singapore who just types “Is Monday a holiday?” As a global company with employees in the US, Europe, India, and Singapore, we would look at the attributes in their roles and RBAC application.

We’d construct the actual prompt as: “As a resident of Singapore who is a full-time employee in this company, based on the Singapore corporate policy which exists in the Singapore’s folder, can you retrieve the right information and get back to me?”

We inherit the role an employee has, tie it up in our system, and ensure that when we’re retrieving information, it’s based on their role. It’s like getting response-based access – based on your role, you’ll get the response from your RAG system that’s relevant to you.

This ensures you’re not retrieving corporate policy that belongs to someone in the US, for example. While there might be no harm in that, it creates confusion. Now think about use cases where there’s sensitive data, like a finance controller talking about forecasted data. We limit access and only show what’s relevant based on the RBAC that has been defined.

David: Looking ahead, what developments or trends do you foresee in the field of secure and privacy-focused AI, particularly for enterprise applications, in the next one to three years?

Protik: It’s a difficult question because one to three years in software could be equivalent to 10 to 30 years given the pace of innovation. Today, I see a lot of challenges and great scope for improvement in relevance and focus on explainability.

I think RAG systems will mature significantly in the next three to six months. People are using various approaches, like combining graphs with RAG or using different ways to retrieve information. Accuracy is one dimension, but the relevance of that accuracy, measuring it, how you do the ranking, and how you continuously serve the right documents are interesting problems.

These problems might be solved in 6-9 months rather than three years. Today, the retrieval and generation pieces are fairly easy to solve, and as elements become commoditized with options between open-source tools like LLaMA 3 and ChatGPT, the quality of responses will be good.

We want to ensure that relevance is retained over time and that we have metrics or measurements, almost like giving a relevance score, that are very specific when retrieving information. This will give people more confidence.

If you add explainability and allow people to gain more insight into why a response is coming in a certain way, I think that will go a long way. Some tools are already doing this in demos, but I think it becoming mainstream and almost a mandatory requirement will really help people adopt and trust AI.

Well, I’m really excited about the technology you and your company are building and all the evolution we’re seeing in the RAG space. It’s such an emerging field. Protik Mukhopadhyay, thanks so much for sharing with us and the listeners at HumAIn about all that you do.

Protik: Thank you very much, I appreciate the time. Along with Protecto, we also have a community called AI Ignite Studio where we connect product companies and founders. If anyone is interested, especially data practitioners and Chief Data Officers who want to connect with innovative companies, we’d love to share more insights. It’s a nonprofit, free for any practitioner to join, and I’m very happy that I’m getting to build that alongside Protecto.

David: Beautiful. We always love causes for the community and paying it forward with a nonprofit cause. Look forward to us all checking it out. It’s been fantastic having you on the show today.

Protik: Thank you, David. Really appreciate the time. Have a good weekend.