Get 10x efficient in building production-ready GenAI RAG applications
Large Language Models (LLMs) like OpenAI’s ChatGPT have been around for some time now and have shown they are great for automating many tasks like content generation, text processing, information retrieval, etc. The only thing limiting us at the moment is our creativity and context window. 😉
Context window?
Yes, LLMs are frozen in time. At the time of writing this post, ChatGPT’s most recent context is September of 2021. Because your data is evolving and not seen by the LLM at the time of training, it can’t interpret your proprietary data. If you ask it something about your internal or real-time data, which an LLM has never seen, it will make up an answer, also called hallucinating.
Retrieval Augmented Generation (RAG) pipelines fetch relevant context related to the user query so that LLMs can generate their answer based on the context, thereby generating grounded contextual responses.
Unfortunately, we are limited by every LLM’s context window length, which requires us to be selective about the information we add to the prompt. In some cases, we can’t cover all of the necessary context for an LLM to answer our queries.
For example, consider the case of summarizing or answering queries on your application code from GitHub. You can’t understand all the details from just a few functions or pages. Likewise, LLMs can’t answer accurately with limited context in all cases.
Bullish on context window size increase!
As Moore’s law has it, the GenAI industry is striving for a larger context window. Context window length of OpenAI’s GPT models has doubled in 3-4 months release frame.
Anthropic also released a 100k context window Claude model, which can process 240 pages of API docs in a single prompt.
But bigger is not always better. Recent research shows that a longer context decreases performance. You can read more about this fantastic research in "Lost in the Middle" paper.
So, how much context is good context?
Building GenAI production application is challenging.
We’re still in the early days of GenAI, with the technology quickly evolving.
While context length is a critical component to the quality of application results, AI engineers have to deal with many such decisions and have to keep revisiting these questions as tech advances:
- What is the best vector DB in terms of cost and performance?
- What is the best chunking strategy?
- How do we avoid malicious attacks for different LLM data query patterns?
- What is the best model in terms of cost and performance?
- And then issues around reliably serving the app itself – how to debug when something goes wrong? How to monitor the model’s performance?
And many more.
Spending a lot of time outside of building core GenAI product.
Let us review a slightly abstracted tech stack of LLM applications.
The first layer – applications is what you are building for your customers, e.g., chatbot to resolve issues with online orders. Second, business logic determines the execution logic, e.g., how the chatbot responds to incomplete order complaints. The third and fourth layers are the foundational layers with context-augmented LLMs and data sources.
Given the fast-paced environment, it's equally important to avoid dedicating excessive time to constructing data APIs and setting up LLMOps (crucial yet somewhat mundane tasks 🤷♀️) when you could be focused on creating exciting and lucrative applications. 🚀
Boost your productivity with Hasura and Portkey ✨
Let us understand how Hasura and Portkey can boost your productivity and help you stay competitive.
Understanding Hasura
Data-driven applications have always required secure data access. GenAI applications bring in new data query challenges:
- Time-consuming
Building data APIs for multiple data sources with authentication and authorization, like role-based access control, is time-consuming and not optimized for performance, scalability, or reliability. - Security is not easy
Securely querying vector databases is challenging. Vector databases are not connected to the application data model, making it difficult to query vectors for specific entities or apply access permissions. - Integration is complex
Vector data stores with additional metadata that require syncing data between application and vector databases. There is a high risk of the vector database going out of sync, leading to stale results.
Hasura enables you to build production-grade data APIs in no time, unifying your multiple private data sources to give you secure, unified query capabilities with authentication and role-based access control.
Hasura can help you to keep your data stores in sync with Event and Scheduled Triggers to automatically vectorize data whenever a CRUD event happens on your data.
Build a secure LLM application API with Hasura.
Easily convert any business logic into a secure API using Hasura Actions, including your LLM queries.
Understanding Portkey
Portkey, in simple words, brings the DevOps principle of reliability to the LLM world – it adds production capabilities to your RAG app extremely easily without you having to change anything.
And it does that by helping you answer four questions about your app:
- How can I see all my RAG calls, and identify which are failing?
- What is my latency and how can I reduce it?
- How can I stay compliant and keep my LLM calls secure?
- How can I ensure reliability when any upstream component fails?
Portkey essentially operates as a gateway between your app and your LLM provider, and makes your workflows production-grade with its observability and reliability layers.
Portkey provides:
Observability Layer
- Logging: Portkey logs all requests by default so you can see the details of a particular call and debug easily.
- Tracing: Trace various prompt chains to a single id and analyze them better.
- Custom Tags: Add production-critical tags to requests that you can segment and categorize later for deeper insights on how your app is being used.
Reliability Layer
- Automated Fallbacks and Retries: Ensure your application remains functional even if a primary service fails.
- Load Balancing: Efficiently distribute incoming requests among multiple models to reduce latencies and maintain rate-limiting thresholds.
- Semantic Caching: Reduce costs and latency by intelligently caching of requests. (20% of requests can become 20x times faster).
All of this, in your existing code, with just one line of change to integrate Portkey.
Are you excited about Hasura and Portkey? We would love to hear what you are building, share your thoughts and feedback in the comments.