-3.1 C
United States of America
Saturday, February 8, 2025

Contextual Retrieval for Multimodal RAG on Slide Decks


Think about a world the place discovering info in a doc is as straightforward as asking a query—and getting a response that mixes each textual content and pictures seamlessly. On this information, we dive into constructing a Multimodal Retrieval-Augmented Era pipeline that may just do that. You’ll discover ways to parse textual content and pictures from a PDF slide deck utilizing instruments like LlamaParse, create contextual summaries for enhanced retrieval, and feed this information into superior fashions like GPT-4 for question answering. Alongside the way in which, we’ll discover how contextual retrieval improves accuracy, optimize prices with immediate caching, and evaluate outcomes between baseline and enhanced pipelines. Get able to unlock the potential of RAG with this step-by-step walkthrough!

Studying Goals

  • Perceive easy methods to parse PDF slide decks for textual content and pictures utilizing LlamaParse.
  • Study so as to add contextual summaries to textual content chunks for improved retrieval accuracy.
  • Construct a Multimodal RAG pipeline combining textual content and pictures with LlamaIndex.
  • Discover the combination of multimodal information into fashions like GPT-4.
  • Examine retrieval efficiency between baseline and contextual indices.

This text was revealed as part of the Information Science Blogathon.

Constructing a Contextual Multimodal RAG Pipeline

Contextual retrieval was initially launched on this Anthropic weblog publish. The high-level instinct is that each chunk is given a concise abstract of the place that chunk suits in with respect to the general abstract of the doc. This permits insertion of high-level ideas/key phrases that allow this chunk to be higher retrieved for various kinds of queries.

These LLM calls are costly. Contextual retrieval is determined by immediate caching so as to be environment friendly.

On this pocket book, we use Claude 3.5-Sonnet to generate contextual summaries. We cache the doc as textual content tokens, however generate contextual summaries by feeding within the parsed textual content chunk.

We feed each the textual content and picture chunks into the ultimate multimodal RAG pipeline to generate the response.

In a Retrieval-Augmented Era (RAG) pipeline, we usually:

  • Parse our supply information (e.g. PDF paperwork, photos, slides).
  • Embed and index chunks of textual content for retrieval.
  • Retrieve related chunks for a given question.
  • Synthesize a response by feeding the retrieved chunks (and, optionally, any related photos or further metadata) right into a Giant Language Mannequin (LLM).

Contextual Retrieval is a neat enhancement to straightforward RAG. Every chunk of textual content is annotated with a brief abstract that situates it inside the broader doc context. This helps the retriever choose the chunk extra precisely for queries that may not match the precise phrases however relate to the general subject or idea.

Overview of the Multimodal RAG Pipeline

We’ll reveal easy methods to construct a Multimodal RAG pipeline over a PDF slide deck, utilizing:

  • Anthropic as our predominant LLM (Claude 3.5-Sonnet).
  • VoyageAI embeddings for chunk embedding.
  • LlamaIndex for our retrieval/indexing abstractions.
  • LlamaParse for extracting textual content and pictures from the PDF slides.
  • OpenAI GPT-4 fashion multimodal mannequin for last question answering (in textual content+picture mode).

We may even present easy methods to cache LLM calls to reduce prices, since Contextual Retrieval can generate a variety of immediate calls.

Atmosphere Setup and Dependencies

You’ll want to put in or improve just a few packages:

!pip set up -U llama-index llama-parse
!pip set up -U llama-index-callbacks-arize-phoenix

Moreover:

  • Anthropic API Key: Set os.environ[“ANTHROPIC_API_KEY”] = “”.
  • VoyageAI API Key: Set os.environ[“VOYAGE_API_KEY”] = “”.

Setup Observability with LlamaTrace (Arize Integration)

We setup an integration with LlamaTrace (integration with Arize).

When you haven’t already performed so, make sure that to create an account right here: https://llamatrace.com/login. Then create an API key and put it within the PHOENIX_API_KEY variable beneath.

Voyage AI makes use of API keys to watch utilization and handle permissions. To acquire your key, please register together with your Voyage AI account and click on the “Create new API key” button within the dashboard. Add Cost particulars as nicely , however nonetheless Your first 200 million tokens are nonetheless free for Voyage collection 3 fashions.

Phoenix API key might be obtained by signing up for LlamaTrace right here , then navigate to the underside left panel and click on on ‘Keys’ the place it’s best to discover your  API key.

import os
import nest_asyncio

nest_asyncio.apply()

# Arize Phoenix
PHOENIX_API_KEY = "<PHOENIX_API_KEY>"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
import llama_index.core
llama_index.core.set_global_handler(
    "arize_phoenix",
    endpoint="https://llamatrace.com/v1/traces"
)

Load and Parse the PDF Slides

In our instance, we’ll parse the ICONIQ 2024 State of AI Report. This PDF is publicly out there on the URL beneath. When you favor, you’ll be able to substitute it with any PDF you’ve got.

!mkdir information
!mkdir data_images_iconiq
!wget "https://cdn.prod.website-files.com/65e1d7fb19a3e64b5c36fb38/66eb856e019e59758ef73759_ICONIQpercent20Analyticspercent20percent2Bpercent20Insightspercent20-%20Statepercent20ofpercent20AIpercent20Sep24.pdf" -O information/iconiq_report.pdf

Mannequin Setup

Let’s arrange the core elements required to construct and implement our Multimodal RAG pipeline successfully.

import os
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.voyageai import VoyageEmbedding
from llama_index.core import Settings

# Change together with your precise keys
os.environ["ANTHROPIC_API_KEY"] = "sk-..."
os.environ["VOYAGE_API_KEY"] = "..."

llm = Anthropic(mannequin="claude-3-5-sonnet-20240620")
embed_model = VoyageEmbedding(model_name="voyage-3")

Settings.llm = llm
Settings.embed_model = embed_model

Parse Textual content and Pictures with LlamaParse

On this instance, use LlamaParse to parse each the textual content and pictures from the doc.

We parse out the textual content with LlamaParse premium.

NOTE: The report has 40 pages, and at ~5c per web page, it will value you $2. Only a heads up!

For acquiring the LlamaCloud API key, click on on the ‘Get began’ right here https://www.llamaindex.ai/contact , and login. As soon as redirected to the LlamaCloud dashboard, generate a brand new API key by navigating to the API pane on the left.

from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    premium_mode=True,
    # invalidate_cache=True  # Uncomment if you wish to pressure a contemporary parse
    api_key = 'LlamaCloud-API-Key'
)
print("Parsing textual content...")
md_json_objs = parser.get_json_result("information/iconiq_report.pdf")
md_json_list = md_json_objs[0]["pages"]

image_dicts = parser.get_images(md_json_objs, download_path="data_images_iconiq")

Construct Multimodal Nodes

Multimodal nodes are the constructing blocks that enable us to course of and combine numerous information sorts like textual content and pictures. Right here, we’ll assemble nodes to parse, embed, and index chunks from a PDF slide deck, setting the muse for a strong retrieval system.

Every PDF web page corresponds to at least one “node” containing:

  • Textual content (parsed into Markdown)
  • Picture (screenshot of that web page)

Break up Pages into Textual content Nodes

On this step, we’ll cut up the PDF pages into smaller, manageable textual content nodes. This ensures environment friendly embedding and retrieval by breaking down the content material into significant chunks for exact contextual evaluation.

from pathlib import Path
from llama_index.core.schema import TextNode
from typing import Optionally available
import re

def get_page_number(file_name):
    match = re.search(r"-page_(d+).jpg$", str(file_name))
    if match:
        return int(match.group(1))
    return 0

def _get_sorted_image_files(image_dir):
    raw_files = [
        f for f in list(Path(image_dir).iterdir()) if f.is_file() and "-page" in str(f)
    ]
    return sorted(raw_files, key=get_page_number)

def get_text_nodes(image_dir, json_dicts):
    nodes = []
    image_files = _get_sorted_image_files(image_dir)
    md_texts = [d["md"] for d in json_dicts]

    for idx, md_text in enumerate(md_texts):
        chunk_metadata = {
            "page_num": idx + 1,
            "image_path": str(image_files[idx]),
            "parsed_text_markdown": md_text,
        }
        node = TextNode(textual content="", metadata=chunk_metadata)
        nodes.append(node)

    return nodes

text_nodes = get_text_nodes("data_images_iconiq", md_json_list)

Add Contextual Summaries

Contextual retrieval attaches a brief, high-level abstract to every chunk, describing the place it suits into the general doc. We’ll use the LLM to generate these quick summaries and retailer them in every node’s metadata[“context”].

from copy import deepcopy
from llama_index.core.llms import ChatMessage
from llama_index.core.prompts import ChatPromptTemplate
import time


whole_doc_text = """
Right here is your complete doc.
<doc>
{WHOLE_DOCUMENT}
</doc>"""

chunk_text = """
Right here is the chunk we wish to situate inside the entire doc
<chunk>
{CHUNK_CONTENT}
</chunk>
Please give a brief succinct context to situate this chunk inside the total doc for 
the needs of bettering search retrieval of the chunk. Reply solely with the succinct context and nothing else."""


def create_contextual_nodes(nodes, llm):
    """Operate to create contextual nodes for an inventory of nodes"""
    nodes_modified = []

    # get total doc_text string
    doc_text = "n".be part of([n.get_content(metadata_mode="all") for n in nodes])

    for idx, node in enumerate(nodes):
        start_time = time.time()
        new_node = deepcopy(node)

        # Mix whole_doc_text and chunk_text right into a single string
        user_content = (
            f"{whole_doc_text.format(WHOLE_DOCUMENT=doc_text)}nn"
            f"{chunk_text.format(CHUNK_CONTENT=node.get_content(metadata_mode="all"))}"
        )

        messages = [
            ChatMessage(role="system", content="You are a helpful AI Assistant."),
            ChatMessage(role="user", content=user_content),
        ]

        # Ship messages to the LLM and get a response
        new_response = llm.chat(messages)
        new_node.metadata["context"] = str(new_response)

        nodes_modified.append(new_node)
        print(f"Accomplished node {idx}, {time.time() - start_time}")

    return nodes_modified

Tip: We’re passing an extra_headers parameter with a hypothetical prompt-caching date. That is simply as an instance the way you may cross customized headers for Anthropic caching. Precise utilization can fluctuate.

Construct and Persist the Index

We’ll now embed these summarized chunks and retailer them in a vector retailer for retrieval. LlamaIndex can persist indices regionally or combine with 40+ exterior vector databases.

import os
from llama_index.core import (
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)

if not os.path.exists("storage_nodes_iconiq"):
    index = VectorStoreIndex(new_text_nodes, embed_model=embed_model)
    index.set_index_id("vector_index")
    index.storage_context.persist("./storage_nodes_iconiq")
else:
    storage_context = StorageContext.from_defaults(persist_dir="storage_nodes_iconiq")
    index = load_index_from_storage(storage_context, index_id="vector_index")

retriever = index.as_retriever()

Baseline Index (With out Summaries)

We’ll additionally construct a “baseline” index on the unique textual content nodes (with out the contextual summaries) to check the distinction in retrieval high quality.

if not os.path.exists("storage_nodes_iconiq_base"):
    base_index = VectorStoreIndex(text_nodes, embed_model=embed_model)
    base_index.set_index_id("vector_index")
    base_index.storage_context.persist("./storage_nodes_iconiq_base")
else:
    storage_context = StorageContext.from_defaults(
        persist_dir="storage_nodes_iconiq_base"
    )
    base_index = load_index_from_storage(storage_context, index_id="vector_index")

Construct a Multimodal Question Engine

We would like a RAG pipeline that:

  • Retrieves related chunks of textual content.
  • Additionally masses the web page photos.
  • Sends each textual content chunks and pictures to a multimodal LLM (right here we illustrate utilizing an OpenAI-like GPT-4 multimodal endpoint, labeled gpt-4o).
import base64
import openai
import os
from typing import Optionally available, Record

from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.base.response.schema import Response
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import NodeWithScore, MetadataMode

QA_PROMPT_TMPL = """
Under we give parsed textual content from slides, in addition to photos.

---------------------
{context_str}
---------------------

Given the context info and no prior information, please reply the question:

Question: {query_str}
Reply:
"""

QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)

def encode_image(image_path: str) -> str:
    """If you wish to inline an area picture in base64."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.learn()).decode("utf-8")


class MultimodalQueryEngine(CustomQueryEngine):
    """
    Customized multimodal Question Engine that retrieves textual content nodes,
    then sends them + picture(s) to the brand new Imaginative and prescient-capable API as documented.
    """

    def __init__(
        self,
        retriever: BaseRetriever,
        model_name: str = "gpt-4o",
        qa_prompt: Optionally available[PromptTemplate] = None,
    ) -> None:
        tremendous().__init__(qa_prompt=qa_prompt or QA_PROMPT)
        self.retriever = retriever
        self.model_name = model_name

    def custom_query(self, query_str: str) -> Response:
        # 1) Retrieve textual content nodes
        node_with_scores: Record[NodeWithScore] = self.retriever.retrieve(query_str)

        # 2) Construct context
        context_str = "nn".be part of(
[nws.node.get_content(metadata_mode=MetadataMode.LLM) for nws in node_with_scores]

        )

        # 3) Format the ultimate immediate
        formatted_prompt_text = self._qa_prompt.format(
            context_str=context_str,
            query_str=query_str,
        )

        # 4) Construct the consumer message with textual content + photos
        user_message_content = [
            {
                "type": "text",
                "text": formatted_prompt_text,
            }
        ]

        for nws in node_with_scores:
            image_path = nws.node.metadata.get("image_path", "")
            if image_path:
                base64_data = encode_image(image_path)
                image_url = f"information:picture/jpeg;base64,{base64_data}"
                user_message_content.append(
                    {
                        "kind": "image_url",
                        "image_url": {
                            "url": image_url,
                            "element": "auto"
                        },
                    }
                )

        messages = [
            {
                "role": "user",
                "content": user_message_content,
            }
        ]

        # 5) Name your Imaginative and prescient mannequin
        response = openai.ChatCompletion.create(
            mannequin=self.model_name,
            messages=messages,
            max_tokens=500,
        )

        # 6) Return a Response object
        return Response(
            response=response.selections[0].message.content material,
            source_nodes=node_with_scores,
            metadata={},
        )
        
        # 2) Create a question engine
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=3),
    model_name="gpt-4o",   # or "gpt-4o-mini", "gpt-4-turbo", and so on.
)

base_query_engine = MultimodalQueryEngine(
    retriever=base_index.as_retriever(similarity_top_k=3),
    model_name="gpt-4o",
)

Attempting Out Queries

Let’s question our new pipeline about AI utilization by division.

response = query_engine.question(
    "Which departments use GenAI essentially the most and the way are they utilizing it?"
)
print(str(response))

A typical response may appear like this:

Based mostly on the parsed markdown textual content supplied, the departments/groups that use 
generative AI essentially the most are:

1. **AI, Machine Studying, and Information Science** with a rating of 4.5.
2. **IT** with a rating of 4.0.
3. **Engineering / R&D** with a rating of three.9.

These scores are derived from a survey the place respondents rated the extent of
generative AI utilization on a scale of 1-5.

By way of how these departments are utilizing generative AI:

- **AI, Machine Studying, and Information Science**: Whereas particular use instances for this
division are usually not detailed within the supplied textual content, it may be inferred that they're
doubtless utilizing generative AI for superior information evaluation, mannequin growth, and
enhancing AI capabilities inside the group.

- **IT**: The IT division is utilizing generative AI for a number of impactful use instances,
together with:
- Ticket administration
- Chatbots
- Buyer help and troubleshooting
- Information administration
- Case summarization

The details about the departments and their use instances comes from the parsed
markdown textual content. There aren't any discrepancies between the parsed markdown and the
context supplied, because the markdown textual content clearly outlines each the departments with
the best utilization scores and the precise use instances for the IT division.

Comparatively, if we run the identical question on the baseline index:

base_response = base_query_engine.question(
    "Which departments use GenAI essentially the most and the way are they utilizing it?"
)
print(str(base_response))

You’ll see the baseline may need fewer particulars or barely totally different retrieval outcomes. Contextual retrieval offers extra exact context across the IT utilization particularly. The response would appear like:

Based mostly on the parsed markdown textual content supplied, the departments that use Generative AI
(GenAI) essentially the most are:

1. **AI, Machine Studying, and Information Science** - This division has the best
weighted common rating of 4.5 for GenAI utilization, indicating important adoption. The
particular use instances are usually not detailed within the parsed textual content, however given the character of the
division, it's doubtless concerned in creating and refining AI fashions and
algorithms.

2. **IT** - With a rating of 4.0, the IT division can be a number one consumer of GenAI.
The use instances for IT embrace inner productiveness enhancements and IT operations,
as indicated by the 61% adoption fee for inner productiveness and 42% ROI point out
in IT use instances.

3. **Engineering / R&D** - This division has a rating of three.9. Whereas particular use
instances are usually not detailed within the parsed textual content, it's affordable to deduce that GenAI is
used for product growth and analysis functions, as advised by the 69%
adoption fee for core product efficiency enhancements and 50% for pure language
interfaces.

The knowledge is derived from the parsed markdown textual content, which offers an in depth
breakdown of GenAI utilization by division and particular use instances. There aren't any
discrepancies between the parsed markdown and the uncooked textual content, because the markdown seems
to be a structured illustration of the identical information. The picture was not supplied, so
it was not utilized in forming the reply.

Observing the Advantages of Contextual Retrieval

Right here’s one other instance question, On this subsequent query, the identical sources are retrieved with and with out contextual retrieval, and the reply is right for each approaches. That is thanks for LlamaParse Premium’s capability to understand graphs.

question = "what are related insights from the 'deep dive on infrastructure' part when it comes to mannequin preferences, value, deployment environments?"

response = query_engine.question(question)
print(str(response))

Output

The "Deep Dive on Infrastructure" part from the ICONIQ Progress report offers
insights into the infrastructure points crucial for deploying AI options.
Nevertheless, the parsed markdown textual content doesn't explicitly point out mannequin preferences or
prices on this part. As a substitute, it focuses on infrastructure tooling and deployment
environments.

From the parsed markdown textual content, we are able to collect the next insights associated to
deployment environments:

1. **Deployment Environments**: Enterprises are primarily internet hosting generative AI
workloads on the cloud or utilizing a hybrid strategy. The popular deployment strategies
are:
- Cloud: 56%
- Hybrid: 42%
- On-prem: 2%

2. **Cloud Service Suppliers**: Probably the most utilized cloud service suppliers for
internet hosting AI workloads are:
- Amazon Internet Companies (AWS): 68%
- Microsoft Azure: 61%
- Google Cloud (GCP): 40%

These insights are derived from the parsed markdown textual content, particularly from the
sections discussing "Cloud Deployment Methodology" and "Infrastructure Tooling." There's
no point out of mannequin preferences or value issues within the supplied textual content. If
there have been any discrepancies or further particulars within the picture or uncooked textual content, they
are usually not out there right here, so the reply is predicated solely on the parsed markdown textual content
supplied.

Now, lets strive with the baseline strategy:

base_response = base_query_engine.question(question)
print(str(base_response))

Output

The parsed textual content from the slides doesn't present particular insights relating to mannequin preferences, value, or deployment environments within the 'deep dive on infrastructure' part. The slide titled "Deep Dive on Infrastructure" (web page 24) solely incorporates the title, the ICONIQ Progress branding, and confidentiality and copyright notices. There isn't any detailed info or information offered within the parsed textual content for this part.

Subsequently, based mostly on the parsed markdown textual content supplied, there aren’t any related insights out there from the ‘deep dive on infrastructure’ part relating to mannequin preferences, value, or deployment environments. If there have been any photos related to this part, they weren’t supplied, and thus no further insights may very well be derived from them.

This conclusion is drawn from the parsed markdown textual content, which lacks any particular info on mannequin preferences, value, or deployment environments in that part. The picture confirms this, because it solely reveals the title and a graphic with out further particulars.

When you want insights on these subjects, you may wish to seek advice from different sections or slides that particularly handle mannequin preferences, prices, or deployment environments.

  • Contextual Retrieval may fetch the pages that debate cloud deployment strategies, infrastructure tooling, and price references, resulting in a extra thorough response.
  • The baseline strategy may (in some instances) fail to retrieve the right chunk or present much less element.

Evaluating each solutions helps reveal that these quick “contextual summaries” in your metadata typically result in extra related retrieval.

An enormous due to Jerry Liu from LlamaIndex for creating this wonderful pipeline.

Conclusion

On this tutorial, we explored the method of parsing a PDF slide deck utilizing LlamaParse to extract each textual content and pictures, enriching every textual content chunk with contextual summaries to reinforce retrieval accuracy. We demonstrated easy methods to construct a Multimodal RAG pipeline with LlamaIndex, integrating each textual and visible information into a robust mannequin like GPT-4, showcasing the potential of multimodal LLMs. Lastly, we in contrast outcomes from a baseline index to a contextual index, highlighting the numerous enhancements in retrieval precision and relevance achieved by means of the contextual strategy. This complete information equips you with the instruments and strategies to construct efficient multimodal AI options.

Key Takeaways

  • Contextual retrieval improves chunk matching for queries that may not have a direct key phrase overlap.
  • Multimodal RAG can incorporate not simply textual content but in addition photos, charts, or diagrams from slides.
  • Immediate caching is important when chunk sizes are giant and also you’re producing a context abstract for every chunk—this could cut back value considerably.
  • When you have web-based content material (like retailer listings, giant units of HTML pages), you should use ScrapeGraphAI to fetch that information, then feed it into the identical pipeline.

With these steps, you’ll be able to adapt the strategy to any PDF or exterior information supply—whether or not it’s an enormous enterprise information base, advertising and marketing supplies, or your organization’s inner documentation.

Steadily Requested Questions

Q1. What’s “Contextual Retrieval” and why do I want it?

A. Contextual Retrieval is an strategy the place every chunk of textual content in your dataset has a concise abstract that situates it inside the broader doc. This helps your retriever higher match related chunks—particularly for queries that depend on thematic or conceptual overlaps slightly than actual key phrase matches.

Q2. How does Multimodal RAG differ from commonplace RAG?

A. In a Multimodal RAG pipeline, you not solely retrieve and feed textual content chunks into the LLM but in addition associated photos, audio, or different modalities. That is particularly helpful when your information sources are slide decks, PDFs with charts, or any supplies that blend textual content with photos. It permits the mannequin to reference each textual and visible content material for a extra complete reply.

Q3. Why do I want LlamaParse to parse PDF slides?

A. LlamaParse is a parsing utility that may extract each textual content and pictures from a PDF. Conventional PDF extractors typically solely get the textual content or wrestle with embedded charts and diagrams. With LlamaParse, you’ll be able to create “nodes” that embrace a reference to every PDF web page’s picture file—enabling real multimodal retrieval.

This fall. Is it obligatory to create a baseline index with out contextual summaries?

A. No, it isn’t obligatory, but it surely’s an effective way to benchmark the distinction. Having a baseline index helps you see how retrieval outcomes change while you add contextual summaries.

This text was revealed as part of the Information Science Blogathon.

Hello! I am Adarsh, a Enterprise Analytics graduate from ISB, presently deep into analysis and exploring new frontiers. I am tremendous captivated with information science, AI, and all of the revolutionary methods they’ll remodel industries. Whether or not it is constructing fashions, engaged on information pipelines, or diving into machine studying, I like experimenting with the most recent tech. AI is not simply my curiosity, it is the place I see the long run heading, and I am at all times excited to be part of that journey!

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles