AWS Big Data Blog

Integrate sparse and dense vectors to enhance knowledge retrieval in RAG using Amazon OpenSearch Service

In the context of Retrieval-Augmented Generation (RAG), knowledge retrieval plays a crucial role, because the effectiveness of retrieval directly impacts the maximum potential of large language model (LLM) generation.

Currently, in RAG retrieval, the most common approach is to use semantic search based on dense vectors. However, dense embeddings do not perform well in understanding specialized terms or jargon in vertical domains. A more advanced method is to combine traditional inverted-index(BM25) based retrieval, but this approach requires spending a considerable amount of time customizing lexicons, synonym dictionaries, and stop-word dictionaries for optimization.

In this post, instead of using the BM25 algorithm, we introduce sparse vector retrieval. This approach offers improved term expansion while maintaining interpretability. We walk through the steps of integrating sparse and dense vectors for knowledge retrieval using Amazon OpenSearch Service and run some experiments on some public datasets to show its advantages. The full code is available in the github repo aws-samples/opensearch-dense-spase-retrieval.

What’s Sparse vector retrieval

Sparse vector retrieval is a recall method based on an inverted index, with an added step of term expansion. It comes in two modes: document-only and bi-encoder. For more details about these two terms, see Improving document retrieval with sparse semantic encoders.

Simply put, in document-only mode, term expansion is performed only during document ingestion. In bi-encoder mode, term expansion is conducted both during ingestion and at the time of query. Bi-encoder mode improves performance but may cause more latency. The following figure demonstrates its effectiveness.

Neural sparse search in OpenSearch achieves 12.7%(document-only) ~ 20%(bi-encoder) higher NDCG@10, comparable to the TAS-B dense vector model.

With neural sparse search, you don’t need to configure the dictionary yourself. It will automatically expand terms for the user. Additionally, in an OpenSearch index with a small and specialized dataset, while hit terms are generally few, the calculated term frequency may also lead to unreliable term weights. This may lead to significant bias or distortion in BM25 scoring. However, sparse vector retrieval first expands terms, greatly increasing the number of hit terms compared to before. This helps produce more reliable scores.

Although the absolute metrics of the sparse vector model can’t surpass those of the best dense vector models, it possesses unique and advantageous characteristics. For instance, in terms of the NDCG@10 metric, as mentioned in Improving document retrieval with sparse semantic encoders, evaluations on some datasets reveal that its performance could be better than state-of-the-art dense vector models, such as in the DBPedia dataset. This indicates a certain level of complementarity between them. Intuitively, for some extremely short user inputs, the vectors generated by dense vector models might have significant semantic uncertainty, where overlaying with a sparse vector model could be beneficial. Additionally, sparse vector retrieval still maintains interpretability, and you can still observe the scoring calculation through the explanation command. To take advantage of both methods, OpenSearch has already introduced a built-in feature called hybrid search.

How to combine dense and sparse?

1. Deploy a dense vector model

To get more valuable test results, we selected Cohere-embed-multilingual-v3.0, which is one of several popular models used in production for dense vectors. We can access it through Amazon Bedrock and use the following two functions to create a connector for bedrock-cohere and then register it as a model in OpenSearch. You can get its model ID from the response.

def create_bedrock_cohere_connector(account_id, aos_endpoint, input_type='search_document'):
    # input_type could be search_document | search_query
    service = 'es'
    session = boto3.Session()
    credentials = session.get_credentials()
    region = session.region_name
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

    path = '/_plugins/_ml/connectors/_create'
    url = 'https://' + aos_endpoint + path

    role_name = "OpenSearchAndBedrockRole"
    role_arn = "arn:aws:iam::{}:role/{}".format(account_id, role_name)
    model_name = "cohere.embed-multilingual-v3"

    bedrock_url = "https://bedrock-runtime.{}.amazonaws.com/model/{}/invoke".format(region, model_name)

    payload = {
      "name": "Amazon Bedrock Connector: Cohere doc embedding",
      "description": "The connector to the Bedrock Cohere multilingual doc embedding model",
      "version": 1,
      "protocol": "aws_sigv4",
      "parameters": {
        "region": region,
        "service_name": "bedrock"
      },
      "credential": {
        "roleArn": role_arn
      },
      "actions": [
        {
          "action_type": "predict",
          "method": "POST",
          "url": bedrock_url,
          "headers": {
            "content-type": "application/json",
            "x-amz-content-sha256": "required"
          },
          "request_body": "{ \"texts\": ${parameters.texts}, \"input_type\": \"search_document\" }",
          "pre_process_function": "connector.pre_process.cohere.embedding",
          "post_process_function": "connector.post_process.cohere.embedding"
        }
      ]
    }
    headers = {"Content-Type": "application/json"}

    r = requests.post(url, auth=awsauth, json=payload, headers=headers)
    return json.loads(r.text)["connector_id"]
    
def register_and_deploy_aos_model(aos_client, model_name, model_group_id, description, connecter_id):
    request_body = {
        "name": model_name,
        "function_name": "remote",
        "model_group_id": model_group_id,
        "description": description,
        "connector_id": connecter_id
    }

    response = aos_client.transport.perform_request(
        method="POST",
        url=f"/_plugins/_ml/models/_register?deploy=true",
        body=json.dumps(request_body)
    )

    returnresponse 

2. Deploy a sparse vector model

Currently, you can’t deploy the sparse vector model in an OpenSearch Service domain. You must deploy it in Amazon SageMaker first, then integrate it through an OpenSearch Service model connector. For more information, see Amazon OpenSearch Service ML connectors for AWS services.

Complete the following steps:

2.1 On the OpenSearch Service console, choose Integrations in the navigation pane.

2.2 Under Integration with Sparse Encoders through Amazon SageMaker, choose to configure a VPC domain or public domain.

Next, you configure the AWS CloudFormation template.

2.3 Enter the parameters as shown in the following screenshot.

2.4 Get the sparse model ID from the stack output.

3. Set up pipelines for ingestion and search

Use the following code to create pipelines for ingestion and search. With these two pipelines, there’s no need to perform model inference, just text field ingestion.

PUT /_ingest/pipeline/neural-sparse-pipeline
{
  "description": "neural sparse encoding pipeline",
  "processors" : [
    {
      "sparse_encoding": {
        "model_id": "<nerual_sparse_model_id>",
        "field_map": {
           "content": "sparse_embedding"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "<cohere_ingest_model_id>",
        "field_map": {
          "doc": "dense_embedding"
        }
      }
    }
  ]
}

PUT /_search/pipeline/hybird-search-pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "l2"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.5,
              0.5
            ]
          }
        }
      }
    }
  ]
}

4. Create an OpenSearch index with dense and sparse vectors

Use the following code to create an OpenSearch index with dense and sparse vectors. You must specify the default_pipeline as the ingestion pipeline created in the previous step.

PUT {index-name}
{
    "settings" : {
        "index":{
            "number_of_shards" : 1,
            "number_of_replicas" : 0,
            "knn": "true",
            "knn.algo_param.ef_search": 32
        },
        "default_pipeline": "neural-sparse-pipeline"
    },
    "mappings": {
        "properties": {
            "content": {"type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart"},
            "dense_embedding": {
                "type": "knn_vector",
                "dimension": 1024,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 32
                    }
                }            
            },
            "sparse_embedding": {
                "type": "rank_features"
            }
        }
    }
}

Testing methodology

1. Experimental data selection

For retrieval evaluation, we used to use the datasets from BeIR. But not all datasets from BeIR are suitable for RAG. To mimic the knowledge retrieval scenario, we choose BeIR/fiqa and squad_v2 as our experimental datasets. The schema of its data is shown in the following figures.

The following is a data preview of squad_v2.

The following is a query preview of BeIR/fiqa.

The following is a corpus preview of BeIR/fiqa.

You can find question and context equivalent fields in the BeIR/fiqa datasets. This is almost the same as the knowledge recall in RAG. In subsequent experiments, we input the context field into the index of OpenSearch as text content, and use the question field as a query for the retrieval test.

2. Test data ingestion

The following script ingests data into the OpenSearch Service domain:

import json
from setup_model_and_pipeline import get_aos_client
from beir.datasets.data_loader import GenericDataLoader
from beir import LoggingHandler, util

aos_client = get_aos_client(aos_endpoint)

def ingest_dataset(corpus, aos_client, index_name, bulk_size=50):
    i=0
    bulk_body=[]
    for _id , body in tqdm(corpus.items()):
        text=body["title"]+" "+body["text"]
        bulk_body.append({ "index" : { "_index" : index_name, "_id" : _id } })
        bulk_body.append({ "content" : text })
        i+=1
        if i % bulk_size==0:
            response=aos_client.bulk(bulk_body,request_timeout=100)
            try:
                assert response["errors"]==False
            except:
                print("there is errors")
                print(response)
                time.sleep(1)
                response = aos_client.bulk(bulk_body,request_timeout=100)
            bulk_body=[]
        
    response=aos_client.bulk(bulk_body,request_timeout=100)
    assert response["errors"]==False
    aos_client.indices.refresh(index=index_name)

url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset_name}.zip"
data_path = util.download_and_unzip(url, data_root_dir)
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
ingest_dataset(corpus, aos_client=aos_client, index_name=index_name)

3. Performance evaluation of retrieval

In RAG knowledge retrieval, we usually focus on the relevance of top results, so our evaluation uses recall@4 as the metric indicator. The whole test will include various retrieval methods to compare, such as bm25_only, sparse_only, dense_only, hybrid_sparse_dense, and hybrid_dense_bm25.

The following script uses hybrid_sparse_dense to demonstrate the evaluation logic:

def search_by_dense_sparse(aos_client, index_name, query, sparse_model_id, dense_model_id, topk=4):
    request_body = {
      "size": topk,
      "query": {
        "hybrid": {
          "queries": [
            {
              "neural_sparse": {
                  "sparse_embedding": {
                    "query_text": query,
                    "model_id": sparse_model_id,
                    "max_token_score": 3.5
                  }
              }
            },
            {
              "neural": {
                  "dense_embedding": {
                      "query_text": query,
                      "model_id": dense_model_id,
                      "k": 10
                    }
                }
            }
          ]
        }
      }
    }

    response = aos_client.transport.perform_request(
        method="GET",
        url=f"/{index_name}/_search?search_pipeline=hybird-search-pipeline",
        body=json.dumps(request_body)
    )

    return response["hits"]["hits"]
    
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset_name}.zip"
data_path = util.download_and_unzip(url, data_root_dir)
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
run_res={}
for _id, query in tqdm(queries.items()):
    hits = search_by_dense_sparse(aos_client, index_name, query, sparse_model_id, dense_model_id, topk)
    run_res[_id]={item["_id"]:item["_score"] for item in hits}
    
for query_id, doc_dict in tqdm(run_res.items()):
    if query_id in doc_dict:
        doc_dict.pop(query_id)
res = EvaluateRetrieval.evaluate(qrels, run_res, [1, 4, 10])
print("search_by_dense_sparse:")
print(res)

Results

In the context of RAG, usually the developer doesn’t pay attention to the metric NDCG@10; the LLM will pick up the relevant context automatically. We care more about the recall metric. Based on our experience of RAG, we measured recall@1, recall@4, and recall@10 for your reference.

The dataset BeIR/fiqa is mainly used for evaluation of retrieval, whereas squad_v2 is mainly used for evaluation of reading comprehension. In terms of retrieval, squad_v2 is much less complicated than BeIR/fiqa. In the real RAG context, the difficulty of retrieval may not be as high as with BeIR/fiqa, so we evaluate both datasets.

The hybird_dense_sparse metric is always beneficial. The following table shows our results.

Dataset BeIR/fiqa squad_v2
Method\Metric Recall@1 Recall@4 Recall@10 Recall@1 Recall@4 Recall@10
bm25 0.112 0.215 0.297 0.59 0.771 0.851
dense 0.156 0.316 0.398 0.671 0.872 0.925
sparse 0.196 0.334 0.438 0.684 0.865 0.926
hybird_dense_sparse 0.203 0.362 0.456 0.704 0.885 0.942
hybird_dense_bm25 0.156 0.316 0.394 0.671 0.871 0.925

Conclusion

The new neural sparse search feature in OpenSearch Service version 2.11, when combined with dense vector retrieval, can significantly improve the effectiveness of knowledge retrieval in RAG scenarios. Compared to the combination of bm25 and dense vector retrieval, it’s more straightforward to use and more likely to achieve better results.

OpenSearch Service version 2.12 has recently upgraded its Lucene engine, significantly enhancing the throughput and latency performance of neural sparse search. But the current neural sparse search only supports English. In the future, other languages might be supported. As the technology continues to evolve, it stands to become a popular and widely applicable way to enhance retrieval performance.


About the Author

YuanBo Li is a Specialist Solution Architect in GenAI/AIML at Amazon Web Services. His interests include RAG (Retrieval-Augmented Generation) and Agent technologies within the field of GenAI, and he dedicated to proposing innovative GenAI technical solutions tailored to meet diverse business needs.

Charlie Yang is an AWS engineering manager with the OpenSearch Project. He focuses on machine learning, search relevance, and performance optimization.

River Xie is a Gen AI specialist solution architecture at Amazon Web Services. River is interested in Agent/Mutli Agent workflow, Large Language Model inference optimization, and passionate about leveraging cutting-edge Generative AI technologies to develop modern applications that solve complex business challenges.

Ren Guo is a manager of Generative AI Specialist Solution Architect Team for the domains of AIML and Data at AWS, Greater China Region.