Constructing a Bhagavad Gita AI Assistant

February 10, 2025

13

Within the fast-evolving world of AI, giant language fashions are pushing boundaries in pace, accuracy, and cost-efficiency. The latest launch of Deepseek R1, an open-source mannequin rivaling OpenAI’s o1, is a sizzling subject within the AI area, particularly given its 27x decrease price and superior reasoning capabilities. Pair this with Qdrant’s binary quantization for environment friendly and fast vector searches, we will index over 1,000+ web page paperwork. On this article, we’ll create a Bhagavad Gita AI Assistant, able to indexing 1,000+ pages, answering complicated queries in seconds utilizing Groq, and delivering insights with domain-specific precision.

Studying Goals

Implement binary quantization in Qdrant for memory-efficient vector indexing.
Perceive how you can construct a Bhagavad Gita AI Assistant utilizing Deepseek R1, Qdrant, and LlamaIndex for environment friendly textual content retrieval.
Study to optimize Bhagavad Gita AI Assistant with Groq for quick, domain-specific question responses and large-scale doc indexing.
Construct a RAG pipeline utilizing LlamaIndex and FastEmbed native embeddings to course of 1,000+ pages of the Bhagavad Gita.
Combine Deepseek R1 from Groq’s inferencing for real-time, low-latency responses.
Develop a Streamlit UI to showcase AI-powered insights with considering transparency.

This text was printed as part of the Information Science Blogathon.

Deepseek R1 vs OpenAI o1

Deepseek R1 challenges OpenAI’s dominance with 27x decrease API prices and near-par efficiency on reasoning benchmarks. In contrast to OpenAI’s o1 closed, subscription-based mannequin ($200/month), Deepseek R1 is free, open-source, and very best for budget-conscious initiatives and experimentation.

Reasoning- ARC-AGI Benchmark: [Source: ARC-AGI Deepseek]

Deepseek: 20.5% accuracy (public), 15.8% (semi-private).
OpenAI: 21% accuracy (public), 18% (semi-private).

From my expertise thus far, Deepseek does an incredible job with math reasoning, coding-related use circumstances, and context-aware prompts. Nevertheless, OpenAI retains an edge in normal data breadth, making it preferable for fact-diverse purposes.

What’s Binary Quantization in Vector Databases?

Binary quantization (BQ) is Qdrant’s indexing compression method to optimize high-dimensional vector storage and retrieval. By changing 32-bit floating-point vectors into 1-bit binary values, it slashes reminiscence utilization by 40x and accelerates search speeds dramatically.

How It Works

Binarization: Vectors are simplified to 0s and 1s primarily based on a threshold (e.g., values >0 change into 1).
Environment friendly Indexing: Qdrant’s HNSW algorithm makes use of these binary vectors for fast approximate nearest neighbor (ANN) searches.
Oversampling: To stability pace and accuracy, BQ retrieves additional candidates (e.g., 200 for a restrict of 100) and re-ranks them utilizing authentic vectors.

Why It Issues

Storage: A 1536-dimension OpenAI vector shrinks from 6KB to 0.1875 KB.
Pace: Boolean operations on 1-bit vectors execute quicker, decreasing latency.
Scalability: Very best for big datasets (1M+ vectors) with minimal recall tradeoffs.

Keep away from binary quantization for low-dimension vectors (<1024), the place data loss considerably impacts accuracy. Conventional scalar quantization (e.g., uint8) could swimsuit smaller embeddings higher.

Constructing the Bhagavad Gita Assistant

Beneath is the circulation chart that explains on how we will construct Bhagwad Gita Assistant:

Structure Overview

Information Ingestion: 900-page Bhagavad Gita PDF cut up into textual content chunks.
Embedding: Qdrant FastEmbed’s text-to-vector embedding mannequin.
Vector DB: Qdrant with BQ shops embeddings, enabling millisecond searches.
LLM Inference: Deepseek R1 by way of Groq LPUs generates context-aware responses.
UI: Streamlit app with expandable “considering course of” visibility.

Step-by-Step Implementation

Allow us to now observe the steps on by one:

Step1: Set up and Preliminary Setup

Let’s arrange the muse of our RAG pipeline utilizing LlamaIndex. We have to set up important packages together with the core LlamaIndex library, Qdrant vector retailer integration, FastEmbed for embeddings, and Groq for LLM entry.

Word:

For doc indexing, we’ll use a GPU from Colab to retailer the information. This can be a one-time course of.
As soon as the information is saved, we will use the gathering identify to run inferences wherever, whether or not on VS Code, Streamlit, or different platforms.

!pip set up llama-index
!pip set up llama-index-vector-stores-qdrant llama-index-embeddings-fastembed
!pip set up llama-index-readers-file
!pip set up llama-index-llms-groq

As soon as the set up is finished, let’s import the required modules.

import logging
import sys
import os

import qdrant_client
from qdrant_client import fashions

from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.groq import Groq # deep search r1 implementation

Step2: Doc Processing and Embedding

Right here, we deal with the essential job of changing uncooked textual content into vector representations. The SimpleDirectoryReader masses paperwork from a specified folder.

Create a folder, i.e., an information listing, and add all of your paperwork inside it. In our case, we downloaded the Bhagavad Gita doc and saved it within the knowledge folder.

You possibly can obtain the ~900-page Bhagavad Gita doc right here: iskconmangaluru

knowledge = SimpleDirectoryReader("knowledge").load_data()
texts = [doc.text for doc in data]

embeddings = []
BATCH_SIZE = 50

Qdrant’s FastEmbed is a light-weight, quick Python library designed for environment friendly embedding era. It helps standard textual content fashions and makes use of quantized mannequin weights together with the ONNX Runtime for inference, guaranteeing excessive efficiency with out heavy dependencies.

To transform the textual content chunks into embeddings, we’ll use Qdrant’s FastEmbed. We course of these in batches of fifty paperwork to handle reminiscence effectively.

embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")


for web page in vary(0, len(texts), BATCH_SIZE):
    page_content = texts[page:page + BATCH_SIZE]
    response = embed_model.get_text_embedding_batch(page_content)
    embeddings.lengthen(response)

Step3: Qdrant Setup with Binary Quantization

Time to configure Qdrant shopper, our vector database, with optimized settings for efficiency. We create a set named “bhagavad-gita” with particular vector parameters and allow binary quantization for environment friendly storage and retrieval.

There are 3 ways to make use of the Qdrant shopper:

In-Reminiscence Mode: Utilizing location=”:reminiscence:”, which creates a brief occasion that runs solely as soon as.
Localhost: Utilizing location=”localhost”, which requires working a Docker occasion. You possibly can observe the setup information right here: Qdrant Quickstart.
Cloud Storage: Storing collections within the cloud. To do that, create a brand new cluster, present a cluster identify, and generate an API key. Copy the important thing and retrieve the URL from the curl command.

Word the gathering identify must be distinctive, after each knowledge change this must be modified as nicely.

collection_name = "bhagavad-gita"

shopper = qdrant_client.QdrantClient(
    #location=":reminiscence:",
    url = "QDRANT_URL", # exchange QDRANT_URL along with your endpoint
    api_key = "QDRANT_API_KEY", # exchange QDRANT_API_KEY along with your API keys
    prefer_grpc=True
)

We first verify if a set with the desired collection_name exists in Qdrant. If it doesn’t, solely then we create a brand new assortment configured to retailer 1,024-dimensional vectors and use cosine similarity for distance measurement.

We allow on-disk storage for the unique vectors and apply binary quantization, which compresses the vectors to cut back reminiscence utilization and improve search pace. The always_ram parameter ensures that the quantized vectors are stored in RAM for quicker entry.

if not shopper.collection_exists(collection_name=collection_name):
    shopper.create_collection(
        collection_name=collection_name,
        vectors_config=fashions.VectorParams(dimension=1024,
                                           distance=fashions.Distance.COSINE,
                                           on_disk=True),
        quantization_config=fashions.BinaryQuantization(
            binary=fashions.BinaryQuantizationConfig(
                always_ram=True,
            ),
        ),
    )
else:
    print("Assortment already exists")

Step4: Index the doc

The indexing course of uploads our processed paperwork and their embeddings to Qdrant in batches. Every doc is saved alongside its vector illustration, making a searchable data base.

The GPU can be used at this stage, and relying on the information dimension, this step could take a couple of minutes.

for idx in vary(0, len(texts), BATCH_SIZE):
    docs = texts[idx:idx + BATCH_SIZE]
    embeds = embeddings[idx:idx + BATCH_SIZE]

    shopper.upload_collection(collection_name=collection_name,
                                vectors=embeds,
                                payload=[{"context": context} for context in docs])

shopper.update_collection(collection_name= collection_name,
                        optimizer_config=fashions.OptimizersConfigDiff(indexing_threshold=20000))

Step5: RAG Pipeline with Deepseek R1

Course of-1: R- Retrieve related doc

The search operate takes a consumer question, converts it to an embedding, and retrieves essentially the most related paperwork from Qdrant primarily based on cosine similarity. We show this with a pattern question in regards to the Bhagavad-gītā, exhibiting how you can entry and print the retrieved context.

def search(question,okay=5):
  # question = consumer immediate
  query_embedding = embed_model.get_query_embedding(question)
  outcome = shopper.query_points(
            collection_name = collection_name,
            question=query_embedding,
            restrict = okay
        )
  return outcome
  
relevant_docs = search("In Bhagavad-gītā who's the individual dedicated to?")

print(relevant_docs.factors[4].payload['context'])

Course of-2: A- Augmenting immediate

For RAG it’s necessary to outline the system’s interplay template utilizing ChatPromptTemplate. The template creates a specialised assistant educated in Bhagavad-gita, able to understanding a number of languages (English, Hindi, Sanskrit).

It consists of structured formatting for context injection and question dealing with, with clear directions for dealing with out-of-context questions.

from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole

message_templates = [
    ChatMessage(
        content="""
        You are an expert ancient assistant who is well versed in Bhagavad-gita.
        You are Multilingual, you understand English, Hindi and Sanskrit.
        
        Always structure your response in this format:
        <think>
        [Your step-by-step thinking process here]
        </assume>
        
        [Your final answer here]
        """,
        function=MessageRole.SYSTEM),
    ChatMessage(
        content material="""
        We've got offered context data beneath.
        {context_str}
        ---------------------
        Given this data, please reply the query: {question}
        ---------------------
        If the query will not be from the offered context, say `I do not know. Not sufficient data acquired.`
        """,
        function=MessageRole.USER,
    ),
]

Course of-3: G- Producing the response

The ultimate pipeline brings the whole lot collectively in a cohesive RAG system. It follows the Retrieve-Increase-Generate sample: retrieving related paperwork, augmenting them with our specialised immediate template, and producing responses utilizing the LLM. Right here for LLM we’ll use Deepseek R-1 distill Llama 70 B hosted on Groq, get your keys from right here: Groq Console.

os.environ['GROQ_API_KEY'] = "GROQ_API_KEY" # exchange along with your keys
llm = Groq(mannequin="deepseek-r1-distill-llama-70b")


def pipeline(question):
    # R - Retriver
    relevant_documents = search(question)
    context = [doc.payload['context'] for doc in relevant_documents.factors]
    context = "n".be a part of(context)

    # A - Increase
    chat_template = ChatPromptTemplate(message_templates=message_templates)

    # G - Generate
    response = llm.full(
        chat_template.format(
            context_str=context,
            question=question)
    )
    return response
    
    
print(pipeline("""what's the PURPORT of O my trainer, behold the good	military of	 the sons of Pāṇḍu, so
expertly organized by your clever disciple, the son of Drupada."""))

Output: (Syntax: <assume> reasoning </assume> response)

print(pipeline("""
Jayas	tu	pāṇḍu-putrāṇāṁ	yeṣāṁ	pakṣe	janārdanaḥ.
clarify this gita from translation
"""))

Now what if you could use this software once more? Are we purported to endure all of the steps once more?

The reply is not any.

Step6: Saved Index Inference

There may be not a lot distinction in what you’ve got already written. We’ll reuse the identical search and pipeline operate together with the gathering identify that we have to run the query_points.

shopper = qdrant_client.QdrantClient(
        url= "QDRANT_URL",
        api_key = "QDRANT_API_KEY",
        prefer_grpc = True
    )
    

# the search and pipeline code stay the identical. 

def search(question, shopper, embed_model, okay=5):
    collection_name = "bhagavad-gita"
    query_embedding = embed_model.get_query_embedding(question)
    outcome = shopper.query_points(
        collection_name=collection_name,
        question=query_embedding,
        restrict=okay
    )
    return outcome

def pipeline(question, embed_model, llm, shopper):
    # R - Retriever
    relevant_documents = search(question, shopper, embed_model)
    context = [doc.payload['context'] for doc in relevant_documents.factors]
    context = "n".be a part of(context)

    # A - Increase
    chat_template = ChatPromptTemplate(message_templates=message_templates)

    # G - Generate
    response = llm.full(
        chat_template.format(
            context_str=context,
            question=question)
    )
    return response

We’ll use the identical above two features and message_template within the Streamlit app.py.

Step7: Streamlit UI

In Streamlit after each consumer query, the state is refreshed. To keep away from refreshing the complete web page once more, we’ll outline a couple of initialization steps beneath Streamlit cache_resource.

Keep in mind when the consumer enters the query, the FastEmbed will obtain the mannequin weights simply as soon as, the identical goes for Groq and Qdrant instantiation.

import streamlit as st
from time import sleep
import qdrant_client
from qdrant_client import fashions
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.groq import Groq
from dotenv import load_dotenv
import os

load_dotenv()

@st.cache_resource
def initialize_models():
    embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")
    llm = Groq(mannequin="deepseek-r1-distill-llama-70b")
    shopper = qdrant_client.QdrantClient(
        url=os.getenv("QDRANT_URL"),
        api_key=os.getenv("QDRANT_API_KEY"),
        prefer_grpc=True
    )
    return embed_model, llm, shopper
    
st.title("🕉️ Bhagavad Gita Assistant")
# this may run solely as soon as, and be saved contained in the cache
embed_model, llm, shopper = initialize_models()

When you observed the response output, the format is <assume> reasoning </assume> response.

On the UI, I need to hold the reasoning beneath the Streamlit expander, to retrieve the reasoning half, let’s use string indexing to extract the reasoning and the precise response.

def extract_thinking_and_answer(response_text):
    """Extract considering course of and ultimate reply from response"""
    attempt:
        considering = response_text[response_text.find("<think>") + 7:response_text.find("</think>")].strip()
        reply = response_text[response_text.find("</think>") + 8:].strip()
        return considering, reply
    besides:
        return "", response_text

Chatbot Part

Initializes a messages historical past in Streamlit’s session state. A “Clear Chat” button within the sidebar permits customers to reset this historical past.

Iterates via saved messages and shows them in a chat-like interface. For assistant responses, it separates the considering course of (proven in an expandable part) from the precise reply utilizing the extract_thinking_and_answer operate.

The remaining piece of code is a normal format to outline the chatbot element in Streamlit i.e., enter dealing with that creates an enter area for consumer questions. When a query is submitted, it’s displayed and added to the message historical past. Now it processes the consumer’s query via the RAG pipeline whereas exhibiting a loading spinner. The response is cut up into considering course of and reply parts.

def important():
    if "messages" not in st.session_state:
        st.session_state.messages = []

    with st.sidebar:
        if st.button("Clear Chat"):
            st.session_state.messages = []
            st.rerun()

    # Show chat messages
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            if message["role"] == "assistant":
                considering, reply = extract_thinking_and_answer(message["content"])
                with st.expander("Present considering course of"):
                    st.markdown(considering)
                st.markdown(reply)
            else:
                st.markdown(message["content"])

    # Chat enter
    if immediate := st.chat_input("Ask your query in regards to the Bhagavad Gita..."):
        # Show consumer message
        st.chat_message("consumer").markdown(immediate)
        st.session_state.messages.append({"function": "consumer", "content material": immediate})

        # Generate and show response
        with st.chat_message("assistant"):
            message_placeholder = st.empty()
            with st.spinner("Considering..."):
                full_response = pipeline(immediate, embed_model, llm, shopper)
                considering, reply = extract_thinking_and_answer(full_response.textual content)
                
                with st.expander("Present considering course of"):
                    st.markdown(considering)
                
                response = ""
                for chunk in reply.cut up():
                    response += chunk + " "
                    message_placeholder.markdown(response + "▌")
                    sleep(0.05)
                
                message_placeholder.markdown(reply)
                
        # Add assistant response to historical past
        st.session_state.messages.append({"function": "assistant", "content material": full_response.textual content})

if __name__ == "__main__":
    important()

Necessary Hyperlinks

You’ll find the complete code
Different Bhagavad Gita PDF- Obtain
Change the “<replace-api-key>” placeholder along with your keys.

Conclusion

By combining Deepseek R1’s reasoning, Qdrant’s binary quantization, and LlamaIndex’s RAG pipeline, we’ve constructed an AI assistant that delivers sub-2-second responses on 1,000+ pages. This mission underscores how domain-specific LLMs and optimized vector databases can democratize entry to historical texts whereas sustaining price effectivity. As open-source fashions proceed to evolve, the probabilities for area of interest AI purposes are limitless.

Key Takeaways

Deepseek R1 rivals OpenAI o1 in reasoning at 1/twenty seventh the fee, very best for domain-specific duties like scripture evaluation, whereas OpenAI fits broader data wants.
Understanding RAG Pipeline Implementation with demonstrated code examples for doc processing, embedding era, and vector storage utilizing LlamaIndex and Qdrant.
Environment friendly Vector Storage optimization via Binary Quantization in Qdrant, enabling processing of huge doc collections whereas sustaining efficiency and accuracy.
Structured Immediate Engineering implementation with clear templates for dealing with multilingual queries (English, Hindi, Sanskrit) and managing out-of-context questions successfully.
Interactive UI utilizing Streamlit, to inference the appliance as soon as saved within the vector database.

Steadily Requested Questions

Q1. Does binary quantization cut back reply high quality?

A. Minimal affect on recall! Qdrant’s oversampling re-ranks prime candidates utilizing authentic vectors, sustaining accuracy whereas boosting pace 40x and slashing reminiscence utilization by 97%.

Q2. Can the FastEmbed deal with non-English texts like Sanskrit/Hindi?

A. Sure! The RAG pipeline makes use of FastEmbed’s embeddings and Deepseek R1’s language flexibility. Customized prompts information responses in English, Hindi, or Sanskrit. Whereas you should use the embedding mannequin that may perceive Hindi tokens, in our case the token used perceive English and Hindi textual content.

Q3. Why select Deepseek R1 over OpenAI o1?

A. Deepseek R1 affords 27x decrease API prices, comparable reasoning accuracy (20.5% vs o1’s 21%), and superior coding/domain-specific efficiency. It’s very best for specialised duties like scripture evaluation the place price and centered experience matter.

The media proven on this article will not be owned by Analytics Vidhya and is used on the Creator’s discretion.

Information Scientist at AI Planet || YouTube- AIWithTarun || Google Developer Skilled in ML || Gained 5 AI hackathons || Co-organizer of TensorFlow Consumer Group Bangalore || Pie & AI Ambassador at DeepLearningAI