Unleash the Power of the Intel® Gaudi® 3 AI Accelerators for a State-of-the-Art Generative AI Pipeline

Enterprises are increasingly using large language models (LLMs) for their sophisticated generative AI (GenAI) needs specifically around conversational abilities. The retrieval augmented generation (RAG) architecture integrates the strengths of knowledge bases through vector storage with the creative potential of generative models. As the RAG techniques advance, sophisticated document processing techniques for vector database ingestion, such as semantic segmentation and hierarchical indexing, further refine the quality of generated outcomes. Being able to deploy the application across multiple users, user personas, and languages poses another challenge as AI pipelines are built.

Introducing a Scalable AI Revolution with the Intel Gaudi 3 AI Accelerator

IIn this blog, we delve into the capabilities of the Intel Gaudi 3 AI accelerator and Intel® Xeon® platform and demonstrate their ability to address enterprise GenAI industry requirements by taking advantage of LLM and RAG technologies. As demonstrated in this video, we implemented a reference design using out-of-the-box, open source components enabled on Intel Gaudi 3 AI accelerators and using its software stack PyTorch* implementation. Here are a few of the features the reference design can achieve on the Intel Gaudi 3 AI accelerator:

  • Solution scalability
  • Advanced RAG functionalities: semantic chunking and hierarchical indexing
  • Multipersona role-specific responses
  • Multilingual support

Showcasing Cluster Scalability and Robust Performance

The unchanging response latency under significant user load shows the scalability, robustness, and performance of the reference design pipeline. The reference design dynamically scales the number of query generation pods as the rate of queries increases. This dynamic scaling and load balancing of the solution enables the cluster to handle high query demand with ease, maintaining exceptional throughput during peak use times. As use decreases, the reference design scales down the number of query generation pods, releasing unnecessary resources and reducing the cost of the pipeline operation.

Advance RAG Functionality: Semantic Chunking and Hierarchical Indexing

The reference design demonstrates the advanced RAG functionality achievable on the Intel Gaudi 3 AI accelerator by implementing two advanced RAG features: semantic chunking and hierarchical indexing.

Semantic Chunking

This method aims to create more meaningful and context-aware text segments when the text in documents is parsed. Text is split at more natural breakpoints, preserving semantic coherence within each chunk.

Hierarchical Indexing

This RAG technique uses two levels of encoding: document-level summaries and detailed chunks. This method improves the quality of information retrieval by first identifying relevant document sections through summaries and then drilling down to specific details within those sections. It first searches the summary vector store to identify relevant document sections. For each relevant summary, it then searches the detailed chunk vector store to retrieve only the most relevant document sections.

Customized AI Responses: A User-Centric Approach

One of the most novel features of the reference design is its ability to generate responses based on different user roles. Every request is secure with a signed token, JWT (JSON web token). JWT is parsed, claims and signatures are verified, and user permissions are obtained. Based on user permissions and roles, they can query the vector database collection they have access to. Using the vector database metadata capabilities and annotating ingested documents with the user roles they target, the pipeline provides tailored responses based on the user's role, whether they are a developer, a DevOps or site reliability engineer (SRE), or a marketing professional. This personalized interaction is elevated by the pipeline's semantic chunking and hierarchical indexing of documents, ensuring responses are not only precise but also contextually enriched.

The reference design can support multiple user roles, and documents can be annotated with one or more roles of interest.

Multilingual Queries: Bridging Language Gaps

The demonstration further highlights the pipeline's multilingual capabilities, an industry standard demanded by most GenAI implementations. Queries in various languages are effortlessly processed, translated, and contextualized, showcasing the pipeline's ability to cater to a diverse, global user base. Responses are returned in the language in which the questions were asked. The user doesn’t have to select a language profile; the pipeline automatically detects supported languages and adjusts accordingly.

More impressively, the ingested documents used to generate a response can be in a language different from the query. This multilingual support breaks down language barriers and enhances the accessibility of AI technologies worldwide.

AI Software Catalog Meets the Intel Gaudi 3 AI Accelerator: A Seamless Integration

Enterprises are using LLMs with RAG architecture to enhance conversational AI, combining knowledge bases with generative models for improved outcomes. Advanced document processing methods like semantic segmentation and hierarchical indexing are key to refining RAG outputs, while the deployment of AI pipelines must accommodate diverse users and languages.

Conclusion: Pioneering the Future of AI with the Intel Gaudi 3 AI Accelerator and Open Source

The integration of open source software with the Intel Gaudi 3 AI accelerator sets a new standard for the deployment and scalability of GenAI pipelines. This powerful synergy ensures reliability and positions the platform at the forefront of AI innovation. Thank you for joining us on this transformative journey.