Back to portfolio
Published on Apr 21, 2023 by Vitor Sousa

RAG System: A Case Study in Local-First Document Q&A

GitHub badge linking to the RAG with Llama3 and Elasticsearch repository Check the repo: RAG, Llama3, and Elasticsearch project repository


Problem Statement

The Challenge

Organizations accumulate vast amounts of unstructured text data—customer support logs, internal documentation, conversation transcripts—but extracting actionable insights from these documents remains painful. Traditional keyword search fails to understand semantic meaning, returning irrelevant results when users phrase questions differently than the source text.

I wanted to explore how modern LLMs could transform document retrieval. The goal was simple: build a system where users could ask natural language questions about their documents and receive accurate, contextual answers—all running locally without sending data to external APIs.

Why This Matters

The privacy implications of sending proprietary documents to cloud LLM providers concern many organizations. Customer conversations, internal processes, and business-sensitive information shouldn’t leave controlled environments. A local-first RAG system addresses this by keeping both the data and the inference entirely on-premise.

Beyond privacy, I wanted to understand the RAG pipeline deeply. Frameworks like LangChain and LlamaIndex abstract away complexity, but understanding what happens under the hood—chunking strategies, embedding generation, vector similarity search, prompt construction—is essential for debugging production systems and making informed architectural decisions.

Constraints

  • Fully local deployment: No external API calls for embeddings or generation
  • Reasonable hardware: Must run on consumer-grade machines (no GPU clusters)
  • JSON document support: Focus on structured conversation data initially
  • Interactive interface: Users should see results in real-time via a web UI

Technical Approach

System Architecture

The system follows a classic RAG pattern with three distinct phases: ingestion, retrieval, and generation.

flowchart TB subgraph Ingestion["📥 Document Ingestion"] A[JSON Documents] --> B[Document Parser] B --> C[Sentence Splitter<br/>chunk_size=350<br/>overlap=50] C --> D[Ollama Embedding<br/>Llama3] D --> E[(Elasticsearch<br/>Vector Store)] end subgraph Retrieval["🔍 Semantic Retrieval"] F[User Query] --> G[Query Embedding<br/>Llama3] G --> H{Elasticsearch<br/>kNN Search} E --> H H --> I[Top-K Documents<br/>k=10] end subgraph Generation["🤖 Response Generation"] I --> J[Context Assembly] F --> J J --> K[Llama3 LLM<br/>via Ollama] K --> L[Generated Answer] end subgraph Interface["💻 Streamlit UI"] M[File Upload] --> A F --> N[Query Input] L --> O[Response Display] end

Key Design Decisions

1. Elasticsearch Over Specialized Vector DBs

I chose Elasticsearch over purpose-built vector databases like Pinecone, Weaviate, or Qdrant for several reasons:

ConsiderationElasticsearchSpecialized Vector DBs
Hybrid searchNative BM25 + vectorOften vector-only
Operational maturityBattle-tested at scaleNewer, less proven
Local deploymentDocker single-node easyVaries by provider
Existing toolingKibana, monitoring built-inOften limited

For a production system, I’d want the option to combine semantic search with traditional keyword matching. Elasticsearch’s dense vector fields with HNSW indexing provide competitive retrieval quality while maintaining flexibility.

2. LlamaIndex as the Orchestration Layer

Rather than building retrieval pipelines from scratch, LlamaIndex provides:

  • Standardized abstractions: Document, Node, Index concepts map cleanly to RAG workflows
  • Pluggable components: Swap embedding models, vector stores, or LLMs without rewriting logic
  • Built-in chunking: SentenceSplitter handles text segmentation intelligently
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.ollama import OllamaEmbedding

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=350, chunk_overlap=50),
        OllamaEmbedding("llama3")
    ],
    vector_store=es_vector_store
)

The IngestionPipeline abstracts the transform-embed-store workflow into a single declarative configuration.

3. Ollama for Local Model Serving

Ollama dramatically simplifies running LLMs locally. A single command downloads and serves models:

ollama run llama3

This exposes a local API compatible with OpenAI’s format, making it trivial to integrate with LlamaIndex:

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Same model for embeddings and generation
local_llm = Ollama(model="llama3")
embed_model = OllamaEmbedding("llama3")

4. Chunking Strategy

The chunking configuration significantly impacts retrieval quality:

SentenceSplitter(chunk_size=350, chunk_overlap=50)
  • chunk_size=350: Small enough to maintain semantic coherence, large enough to preserve context
  • chunk_overlap=50: Prevents information loss at chunk boundaries

For conversation data, this typically yields 1-3 chunks per conversation, preserving the dialogue flow while enabling fine-grained retrieval.

Implementation Deep Dive

Document Ingestion

The ingestion module transforms raw JSON into embedded vectors stored in Elasticsearch:

from elasticsearch import AsyncElasticsearch
from llama_index.vector_stores.elasticsearch import ElasticsearchStore
from llama_index.core import Document

# Elasticsearch connection
es_client = AsyncElasticsearch("http://localhost:9200")

es_vector_store = ElasticsearchStore(
    index_name="calls",
    vector_field='conversation_vector',
    text_field='conversation',
    es_client=es_client
)

def get_documents_from_json(json_data) -> list:
    """Convert JSON records to LlamaIndex Documents."""
    return [
        Document(
            text=item["conversation"],
            metadata={"conversation_id": item["conversation_id"]}
        )
        for item in json_data
    ]

def ingest_documents(documents):
    """Run the ingestion pipeline."""
    ollama_embedding = OllamaEmbedding("llama3")
    pipeline = IngestionPipeline(
        transformations=[
            SentenceSplitter(chunk_size=350, chunk_overlap=50),
            ollama_embedding
        ],
        vector_store=es_vector_store
    )
    pipeline.run(documents=documents)

Query Engine Setup

The query engine combines retrieval and generation:

from llama_index.core import VectorStoreIndex, QueryBundle, Settings
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama

def setup_query_engine(model_name="llama3"):
    """Configure the RAG query engine."""
    local_llm = Ollama(model=model_name)
    Settings.embed_model = OllamaEmbedding(model_name)

    # Build index from existing vector store
    index = VectorStoreIndex.from_vector_store(es_vector_store)

    # Return query engine with top-10 retrieval
    return index.as_query_engine(local_llm, similarity_top_k=10)

def execute_query(query_engine, query_text):
    """Execute a query and return the response."""
    bundle = QueryBundle(
        query_text,
        embedding=Settings.embed_model.get_query_embedding(query_text)
    )
    return query_engine.query(bundle)

Streamlit Interface

The UI handles document upload and query execution:

async def check_embeddings():
    """Verify if documents are already indexed."""
    try:
        index_exists = await es_client.indices.exists(index="calls")
        if index_exists:
            doc_count = await es_client.count(
                index="calls",
                body={"query": {"exists": {"field": "conversation_vector"}}}
            )
            return doc_count['count'] > 0
        return False
    except Exception as e:
        st.error(f"Elasticsearch connection failed: {e}")
        return False

Upload interface showing the JSON file ingestion flow


Results & Metrics

Qualitative Performance

The system successfully answers semantic queries over conversation data:

Query result showing the system answering a question about billing inquiries

Example query: “Give me a summary of Billing Inquiries and Corrections conversations.”

The system retrieves relevant conversations and synthesizes a coherent summary, demonstrating:

  • Semantic understanding (not just keyword matching)
  • Cross-document synthesis
  • Natural language response generation

Observed Characteristics

MetricObservation
Ingestion speed~2-3 seconds per document (CPU-only, Llama3 8B)
Query latency5-15 seconds end-to-end (retrieval + generation)
Memory usage~8GB for Ollama + Elasticsearch
Retrieval relevanceGood for topically distinct documents

Limitations Discovered

  1. Cold start latency: First query after Ollama restart takes longer as the model loads into memory
  2. Embedding speed: CPU-based embedding generation is the primary bottleneck
  3. Context window: With top_k=10 documents, complex queries can exceed the context window
  4. No re-ranking: Retrieved documents are used as-is without relevance re-ranking

What I’d Do Differently

Immediate Improvements

1. Add a Re-ranking Stage

Raw vector similarity isn’t always the best relevance signal. Adding a cross-encoder re-ranker after initial retrieval would improve answer quality:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, documents, top_k=5):
    pairs = [(query, doc.text) for doc in documents]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_k]]

2. Implement Hybrid Search

Combining dense (vector) and sparse (BM25) retrieval often outperforms either alone:

# Elasticsearch supports hybrid queries natively
{
    "query": {
        "bool": {
            "should": [
                {"match": {"conversation": query_text}},  # BM25
                {"knn": {"conversation_vector": {...}}}   # Dense
            ]
        }
    }
}

3. Add Streaming Responses

The current implementation waits for complete generation. Streaming tokens to the UI would dramatically improve perceived latency:

# LlamaIndex supports streaming
response = query_engine.query(bundle, streaming=True)
for token in response.response_gen:
    yield token

Architectural Changes

4. Separate Embedding and Generation Models

Using Llama3 for both embedding and generation isn’t optimal. Specialized embedding models (e.g., nomic-embed-text, bge-large) typically produce better retrieval results:

# Specialized embedding model
Settings.embed_model = OllamaEmbedding("nomic-embed-text")

# Larger model for generation
local_llm = Ollama(model="llama3:70b")

5. Implement Evaluation Pipeline

Without systematic evaluation, it’s impossible to measure improvements. I’d add:

  • Retrieval metrics: MRR@K, NDCG
  • Generation metrics: ROUGE, BERTScore against reference answers
  • End-to-end: Human evaluation on a held-out test set

6. Add Centroid-based Diverse Retrieval

For long-context scenarios, retrieving semantically diverse documents prevents redundancy:

from sklearn.cluster import KMeans

def get_diverse_contexts(embeddings, k=5):
    """Cluster and return centroid-nearest documents."""
    kmeans = KMeans(n_clusters=k, init='k-means++')
    kmeans.fit(embeddings)
    return kmeans.cluster_centers_

This addresses the “lost in the middle” problem where LLMs underweight information in the middle of long contexts.

Production Considerations

7. Add Observability

Production RAG systems need visibility into:

  • Which documents were retrieved for each query
  • Embedding latencies and cache hit rates
  • LLM token usage and generation times

8. Implement Caching

Repeated queries should hit a cache rather than re-running the full pipeline:

from functools import lru_cache

@lru_cache(maxsize=1000)
def get_query_embedding(query_text):
    return embed_model.get_query_embedding(query_text)

Key Takeaways

  1. Local-first RAG is viable: Ollama + Elasticsearch provides a fully private, self-hosted solution
  2. Chunking matters: Small chunks with overlap preserve context while enabling granular retrieval
  3. LlamaIndex accelerates development: The abstractions handle complexity while remaining customizable
  4. Evaluation is essential: Without metrics, optimization is guesswork
  5. Hybrid approaches win: Combining BM25 + dense retrieval + re-ranking typically outperforms any single method

Resources

Technologies Used

Relevant Papers


More Projects

Explore full portfolio