Better RAG results with Reciprocal Rank Fusion and Hybrid Search
The problem with vector-only search
At Assembled, our issue resolution engine is designed to assist customer support by suggesting potential answers to support queries. We use Retrieval Augmented Generation (RAG) for much of this pipeline because it’s quicker to iterate on than fine-tuning, doesn’t require training on customer data (which many companies prefer), and generally provides high-quality results.
However, we encountered a significant challenge with RAG: relying solely on vector search (even using both dense and sparse vectors) doesn’t always deliver satisfactory results for certain queries. This issue was particularly evident when users entered specific keywords that didn’t accurately match stored knowledge articles.
Customer support teams often have multiple articles on similar topics and lack a tightly curated knowledge base, leading vector search to sometimes return irrelevant results to our RAG engine and reduce response accuracy. Users familiar with traditional keyword searches were puzzled when our system couldn’t find the right documents, especially for short queries with prominent but ambiguous keywords.
For example, if a user asked “what features are included in a premium plan?”, vector search would excel at pulling documents about different plans, customer testimonials, or marketing materials. However, vector search would often fail at finding articles specifically targeting premium plans.
The solution: Hybrid Search with Reciprocal Rank Fusion
To address this issue, we integrated a new keyword search infrastructure that combines its results with vector search for optimal performance. In the above example, keyword search would hone in on “features”, “premium”, and “plan”, and narrow search results to documents specifically matching these keywords. A hybrid approach with both vector and keyword search allows us to effectively return articles with semantic matches while also providing users with the familiar feel of traditional keyword search. Our intuition was based on our experience with other machine learning systems where ensemble models generally outperform single models.
Document store abstraction
To enable a hybrid store solution, we developed a document store abstraction in our code, allowing us to integrate multiple search algorithms. The abstraction is simple but captures all the essential functionalities of a document store and search system. It handles document management and querying and is agnostic to the actual implementation (vector search, keyword search, etc.). Here’s what it looks like:
With this abstraction, we had the primitives we needed to swap different search systems in and out easily. Uploading a document could be done once and then sync across multiple document stores. Similarly, searching for a document could be done in parallel across multiple document stores using a standardized query.
The interesting part is that our hybrid search store itself implements the DocumentStore
interface. This means that, from the perspective of the caller, it doesn't matter whether they are interacting with a single store or our complex hybrid store — they use the same interface and methods. This design ensures that all of the logic for determining which documents are retrieved is hidden from the caller and can be tested separately. To implement the hybrid store, we passed in multiple child document stores and parallelized the search across all of the child stores.
Syncing documents across stores
Enabling multiple document stores introduced technical challenges, especially around ensuring synchronization. Out-of-sync document stores could lead to subtle bugs, such as a document being present in one store but not another. To address this, we implemented the following:
- Single source of truth: We maintain a document store in PostgreSQL (for metadata) and S3 (for the actual documents themselves) as a source of truth. This store implements document storage interfaces but is not included in queries. It serves solely for record keeping, allowing us to resync content if necessary.
- Asynchronous updates: Due to higher latency in storing articles, we first update our source of truth in the database and provide an acknowledgement to the frontend. We then asynchronously store the documents in our child stores. This approach helps manage latency across multiple stores and ensures our document stores are eventually consistent.
- Error handling: We also need to handle errors across different platforms. For example, one store might experience a network outage while another completes the storage process successfully. Our PostgreSQL database tracks the synchronization status of each store. If a store fails to sync, we employ exponential backoff to retry the operation, ensuring that all stores are eventually brought into sync.
Combining results across multiple search engines
To optimize search performance, we explored several algorithms for merging the results from our different document stores.
Weighting-based fusion
Our initial approach involved experimenting with various weighting mechanisms for sparse/dense vectors and keyword search. The goal was to find optimal weightings that leverage the strengths of each search method. However, identifying the correct weightings proved challenging due to the unknown distribution of vector search scores. This variability made it difficult to determine the relative importance of different weightings.
What’s more, empirical data showed that similarity scores (dot product and Euclidean distance) varied widely across our customer base. The differential performance across these metrics made it impractical to develop a universal weighting scheme for combining vector and keyword searches. Tuning these weights on a per-customer basis was not scalable.
Rank fusion
Next, we turned to rank fusion algorithms, inspired by literature reviews and their demonstrated effectiveness in search optimization (see [0] and [1]). Rank fusion algorithms, particularly Reciprocal Rank Fusion (RRF), provided a promising alternative. Here’s how most rank fusion algorithms work:
- Rank assignment: Each document from the individual ranked lists is assigned a score based on its rank position. Typically, the score is the reciprocal of its rank (i.e., 1/rank). For example, a document ranked first gets a score of 1, the second gets 0.5, the third gets 0.33, and so on.
- Score summation: The scores from all ranked lists are summed for each document. Documents appearing in multiple lists accumulate higher combined scores.
- Final ranking: Documents are re-ranked based on their combined scores, producing a final ranked list that integrates the rankings from all individual search engines.
Why we chose RRF
After extensive testing, Reciprocal Rank Fusion (RRF) consistently outperformed many of the more complex methods we evaluated. Several factors contributed to this:
- Simplicity and robustness: RRF’s simplicity makes it less prone to overfitting specific scenarios, aligning with the principle of Occam’s razor. This simplicity enhances its robustness across different contexts.
- Minimal tuning: RRF provides a straightforward and effective way to rank results without the need for extensive parameter tuning. This is particularly advantageous given our diverse customer base and their varying knowledge bases.
By implementing RRF, we achieved a flexible and scalable method for combining search results. Using RRF, we not only enhanced the accuracy and relevance of search outcomes but also simplified the overall search infrastructure, ensuring a robust solution for our diverse customer set.
Optimizing our search infrastructure
Finally, a note on our search engine choices. At Assembled, we use Pinecone for vector search and Algolia for keyword search. After minimal testing with other providers, we concluded that the marginal benefits of further optimization were not significant. Consequently, we decided against hosting our own open-source vector database, such as Milvus, or managing our own keyword search on Elasticsearch.
Using B2B solutions like Pinecone and Algolia offers several advantages:
- Cost efficiency: These services are reasonably cost-effective (especially when comparing to engineering time) and eliminate the need for significant upfront investment in infrastructure.
- Maintenance reduction: Like most companies, we’re resource constrained by engineering resources, so by outsourcing to specialized search companies, we avoid the substantial maintenance burden associated with self-hosted solutions. This allows our team to focus on our core functionality of AI replies for support issues.
- Performance: Algolia in particular provides low-latency responses, a robust API, and highly optimized search outputs which would likely outperform anything we could build on Elasticsearch.
We’re hiring!
We’re seeing exciting results, but there’s always a lot more to do. Since implementing this framework, there have been a lot of developments in RAG-based techniques, such as fine tuning of embedding models, applying matrix transformations on vector results, HyDE, etc. If you’re interested in helping us solve these problems, check out our open roles.
This was also posted on the Assembled engineering blog, check it out for more articles related to what we’re building.