RAG Done Right: Why Most Retrieval Systems Disappoint (And How to Fix Yours)

I’ve reviewed dozens of RAG implementations over the past year. Most of them disappoint their users. The symptoms are always the same: the system retrieves documents that seem relevant but the LLM generates answers that miss the point, hallucinate details, or ignore information that’s sitting right there in the knowledge base.

The problem is rarely the LLM. It’s almost always the retrieval.

Here’s what I’ve learned about building RAG systems that actually work—the mistakes everyone makes and how to fix them.

What RAG Actually Is (30-Second Version)

Retrieval Augmented Generation is a pattern where you search a knowledge base for relevant information and include it in the prompt before asking the LLM to respond. Instead of relying on what the model learned during training, you give it specific context to work with.

flowchart LR
    A[📝 User Query] --> B[🔢 Embed Query]
    B --> C[🔍 Vector Search]
    C --> D[📄 Retrieved Docs]
    D --> E[🎯 Rerank]
    E --> F[📋 Build Prompt]
    F --> G[🤖 LLM]
    G --> H[✅ Answer]
    
    I[(Knowledge Base)] --> C

The basic flow: user asks a question, you convert that question to a vector embedding, search your vector database for similar content, retrieve the top results, stuff them into the prompt, and let the LLM generate an answer using that context.

Simple in concept. Surprisingly hard to get right in practice.

Why Most RAG Systems Fail

The failure modes are predictable because everyone makes the same mistakes.

Chunking without thinking is the first problem. Most tutorials tell you to split documents into 500-token chunks with 50-token overlap and call it done. This works terribly for most real content. A 500-token chunk might split a paragraph mid-sentence, separate a question from its answer, or cut a code example in half. The chunk becomes meaningless out of context, so when it’s retrieved, it doesn’t actually help.

Using the wrong embedding model is the second problem. People default to OpenAI’s text-embedding-ada-002 because it’s easy. It’s fine for general text, but if your content is technical, domain-specific, or in a language other than English, you might be leaving significant retrieval quality on the table. The embedding model determines what “similar” means—if it doesn’t understand your domain, similar content won’t look similar.

Retrieving too few or too many results is the third problem. Retrieve too few and you miss relevant information. Retrieve too many and you overwhelm the context window with marginally relevant content that dilutes the good stuff. Most people pick an arbitrary number like 5 or 10 and never tune it.

No reranking is the fourth problem. Vector similarity is a rough approximation of relevance. The top 20 results by embedding similarity might have 5 highly relevant chunks and 15 that are tangentially related at best. A reranker that looks at query-document pairs more carefully can dramatically improve precision.

Ignoring hybrid search is the fifth problem. Vector search is great for semantic similarity but terrible for exact matches. If the user asks about “error code 5012” and your knowledge base has a document titled “Error Code 5012 Resolution,” vector search might not rank it first because the semantic content is sparse. Keyword search would find it instantly.

Chunking Strategies That Work

The goal of chunking is to create pieces of text that are meaningful on their own and that will be useful when retrieved. Different content needs different strategies.

For prose documents like articles, documentation, or reports, chunk by semantic boundaries rather than fixed token counts. Split on paragraph breaks. Keep headers with their content. If a section is short, combine it with the next section rather than leaving orphan chunks that lack context.

Here’s a simple semantic chunker:

import re
from typing import List

def semantic_chunk(text: str, max_tokens: int = 500, min_tokens: int = 100) -> List[str]:
    """
    Chunk text by paragraphs, combining small paragraphs and splitting large ones.
    """
    # Split on double newlines (paragraph breaks)
    paragraphs = re.split(r'\n\n+', text)
    
    chunks = []
    current_chunk = []
    current_length = 0
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
            
        para_tokens = len(para.split())  # Rough token estimate
        
        # If paragraph alone exceeds max, split it
        if para_tokens > max_tokens:
            if current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                current_chunk = []
                current_length = 0
            
            # Split large paragraph by sentences
            sentences = re.split(r'(?<=[.!?])\s+', para)
            for sentence in sentences:
                sent_tokens = len(sentence.split())
                if current_length + sent_tokens > max_tokens and current_chunk:
                    chunks.append(' '.join(current_chunk))
                    current_chunk = []
                    current_length = 0
                current_chunk.append(sentence)
                current_length += sent_tokens
        
        # If adding this paragraph exceeds max, start new chunk
        elif current_length + para_tokens > max_tokens:
            if current_chunk:
                chunks.append('\n\n'.join(current_chunk))
            current_chunk = [para]
            current_length = para_tokens
        
        # Otherwise, add to current chunk
        else:
            current_chunk.append(para)
            current_length += para_tokens
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    
    # Filter out chunks that are too small (likely not useful)
    return [c for c in chunks if len(c.split()) >= min_tokens]

For structured documents like FAQs, Q&A pairs, or product specs, chunk by logical unit. Each FAQ entry should be one chunk. Each product should be one chunk. Don’t let structure get split across chunks.

For code repositories, chunk by function or class, not by line count. Include docstrings and comments with the code they describe. Consider including the file path and import context so the chunk makes sense in isolation.

For conversational data like support tickets or chat logs, chunk by conversation turn or topic change, not by message count. A single exchange might span 20 short messages but represent one coherent topic that should stay together.

The meta-lesson: there’s no universal chunking strategy. Look at your actual content, look at what queries users will ask, and design chunks that will be useful for those queries.

Choosing Embedding Models

The embedding model determines how your system understands similarity. Different models have different strengths.

OpenAI text-embedding-3-small is the current default choice. It’s cheap ($0.02 per million tokens), fast, and good enough for most English content. The larger text-embedding-3-large is better quality but more expensive.

Cohere embed-v3 is strong for multilingual content and has built-in support for different input types (documents vs. queries). Worth evaluating if you’re not English-only.

Open source options like BGE, E5, and GTE perform competitively on benchmarks and can run locally. If you have privacy requirements or want to avoid API costs at scale, these are viable. The MTEB leaderboard at huggingface.co/spaces/mteb/leaderboard tracks current best performers.

For domain-specific content—legal, medical, scientific—you might benefit from embeddings trained or fine-tuned on that domain. General-purpose embeddings treat “patient” in a medical context the same as “patient” meaning “willing to wait,” which hurts retrieval quality.

Here’s how to evaluate embedding models on your actual data:

from typing import List, Tuple
import numpy as np

def evaluate_embeddings(
    queries: List[str],
    relevant_docs: List[List[str]],  # For each query, list of relevant doc IDs
    all_docs: List[str],
    embed_fn,  # Function that takes list of strings, returns embeddings
    k: int = 10
) -> dict:
    """
    Evaluate retrieval quality for an embedding function.
    Returns precision@k, recall@k, and MRR.
    """
    # Embed all documents
    doc_embeddings = embed_fn(all_docs)
    
    precisions = []
    recalls = []
    mrrs = []
    
    for query, relevant in zip(queries, relevant_docs):
        # Embed query
        query_embedding = embed_fn([query])[0]
        
        # Compute similarities
        similarities = np.dot(doc_embeddings, query_embedding)
        top_k_indices = np.argsort(similarities)[-k:][::-1]
        
        # Calculate metrics
        retrieved_relevant = len(set(top_k_indices) & set(relevant))
        precisions.append(retrieved_relevant / k)
        recalls.append(retrieved_relevant / len(relevant) if relevant else 0)
        
        # MRR: reciprocal rank of first relevant result
        for rank, idx in enumerate(top_k_indices, 1):
            if idx in relevant:
                mrrs.append(1 / rank)
                break
        else:
            mrrs.append(0)
    
    return {
        f"precision@{k}": np.mean(precisions),
        f"recall@{k}": np.mean(recalls),
        "mrr": np.mean(mrrs)
    }

Build a test set of queries with known relevant documents, then compare embedding models on your actual content. The best model on benchmarks isn’t always the best model for your data.

Retrieval Tuning

Once you have chunks and embeddings, tuning the retrieval itself matters more than most people realize.

How many results to retrieve depends on your context window budget and the nature of your queries. Start with retrieving more than you think you need (say, 20), then use a reranker to select the best 5. This gives you better recall without stuffing irrelevant content into the prompt.

Similarity thresholds can filter out low-quality matches. If you retrieve 10 results but 7 of them have similarity scores below 0.7, maybe you should only include the top 3. No threshold is universal—you need to calibrate based on your embeddings and content.

Query transformation often helps more than tuning retrieval parameters. If the user’s query is “how do I fix the login bug,” the embedding might not match well against documentation that discusses “authentication error handling.” Expanding the query or generating hypothetical document content that would answer the query (a technique called HyDE) can improve results.

def expand_query(query: str, llm) -> str:
    """
    Expand a query with related terms and phrasings.
    """
    prompt = f"""
    Given this search query, generate an expanded version that includes:
    - Synonyms for key terms
    - Related concepts
    - Alternative phrasings
    
    Keep it concise (under 100 words).
    
    Query: {query}
    
    Expanded query:
    """
    return llm.generate(prompt)

def hyde_transform(query: str, llm) -> str:
    """
    Generate a hypothetical document that would answer the query.
    Use this document's embedding for retrieval.
    """
    prompt = f"""
    Write a short paragraph that would be a perfect answer to this question.
    Write as if you're the documentation, not as if you're responding to a user.
    
    Question: {query}
    
    Hypothetical documentation:
    """
    return llm.generate(prompt)

HyDE is particularly effective when queries are questions but your knowledge base is declarative documentation.

Reranking: The Most Underused Technique

Vector similarity gets you in the right neighborhood. Reranking gets you to the right house.

A reranker takes your initial retrieval results and re-scores them using a more sophisticated model that looks at the query and each document together. Cross-encoders that process query-document pairs jointly are much more accurate than bi-encoders that embed them separately, but they’re too slow to run on your entire corpus. So you use cheap embedding search to get candidates, then expensive reranking to pick the best ones.

Cohere Rerank is the easiest to use:

import cohere

co = cohere.Client("your-api-key")

def rerank_results(query: str, documents: List[str], top_n: int = 5) -> List[str]:
    """
    Rerank documents using Cohere's reranker.
    """
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n
    )
    
    return [documents[r.index] for r in response.results]

For open-source options, cross-encoder models from Sentence Transformers work well:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_local(query: str, documents: List[str], top_n: int = 5) -> List[str]:
    """
    Rerank using a local cross-encoder model.
    """
    pairs = [[query, doc] for doc in documents]
    scores = reranker.predict(pairs)
    
    ranked_indices = np.argsort(scores)[::-1][:top_n]
    return [documents[i] for i in ranked_indices]

In my experience, adding reranking to a mediocre RAG system improves answer quality more than any other single change.

Hybrid Search: Best of Both Worlds

Pure vector search misses exact matches. Pure keyword search misses semantic connections. Hybrid search combines them.

The simplest approach is to run both searches and merge results:

def hybrid_search(
    query: str,
    vector_search_fn,
    keyword_search_fn,
    k: int = 10,
    vector_weight: float = 0.7
) -> List[str]:
    """
    Combine vector and keyword search results.
    """
    # Get results from both
    vector_results = vector_search_fn(query, k=k*2)  # Over-fetch
    keyword_results = keyword_search_fn(query, k=k*2)
    
    # Score fusion (simple weighted combination)
    scores = {}
    for rank, doc in enumerate(vector_results):
        scores[doc] = scores.get(doc, 0) + vector_weight * (1 / (rank + 1))
    
    for rank, doc in enumerate(keyword_results):
        scores[doc] = scores.get(doc, 0) + (1 - vector_weight) * (1 / (rank + 1))
    
    # Sort by combined score
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:k]]

Many vector databases now support hybrid search natively. Pinecone, Weaviate, and Qdrant all have keyword filtering or hybrid modes. Use them—they’re faster than running separate searches.

The weight between vector and keyword matters. For technical documentation where exact terms matter, weight keywords higher. For conversational queries where intent matters more than exact words, weight vectors higher. You can also learn the weight from user feedback.

Putting It Together

Here’s a complete RAG pipeline with all the pieces:

class RAGPipeline:
    def __init__(
        self,
        embedder,
        vector_store,
        reranker,
        llm,
        retrieval_k: int = 20,
        rerank_k: int = 5
    ):
        self.embedder = embedder
        self.vector_store = vector_store
        self.reranker = reranker
        self.llm = llm
        self.retrieval_k = retrieval_k
        self.rerank_k = rerank_k
    
    def query(self, user_query: str) -> str:
        # Step 1: Retrieve candidates
        query_embedding = self.embedder.embed(user_query)
        candidates = self.vector_store.search(
            query_embedding, 
            k=self.retrieval_k
        )
        
        # Step 2: Rerank
        reranked = self.reranker.rerank(
            user_query, 
            candidates, 
            top_n=self.rerank_k
        )
        
        # Step 3: Generate answer
        context = "\n\n---\n\n".join(reranked)
        prompt = f"""
        Answer the question based on the provided context.
        If the context doesn't contain the answer, say so.
        
        Context:
        {context}
        
        Question: {user_query}
        
        Answer:
        """
        
        return self.llm.generate(prompt)

The pipeline is simple. The quality comes from each component being well-tuned for your specific use case.

Debugging When Things Go Wrong

When your RAG system gives bad answers, diagnose systematically.

First, check retrieval. Are the retrieved documents actually relevant to the query? If not, your problem is retrieval—chunking, embeddings, or search parameters. Look at the similarity scores. Look at what’s being retrieved versus what should be retrieved.

Second, check the prompt. If retrieval is good but answers are bad, the LLM might not be using the context effectively. Is the context too long? Is it formatted confusingly? Try simplifying the prompt or being more explicit about using only the provided context.

Third, check for context window overflow. If you’re stuffing too much into the prompt, important information might get lost in the middle. LLMs have “lost in the middle” problems where they pay more attention to the beginning and end of long contexts.

Build logging that captures the query, retrieved documents, prompt, and response for every request. When users report bad answers, you can trace exactly what happened.

The Bottom Line

Good RAG requires getting the basics right: chunk thoughtfully, choose embeddings that understand your domain, tune retrieval parameters, add reranking, and combine vector search with keyword search for robustness.

None of these techniques are exotic. They’re just rarely applied together with care. Most RAG systems fail because they use default settings for each component and hope for the best.

Start with evaluation. Build a test set of queries with known good answers. Measure retrieval quality and end-to-end answer quality. Then improve systematically—you can’t fix what you can’t measure. For automated evaluation at scale, consider using LLM judges to assess answer quality.