Choosing the Right SBERT Model for Your Data with BEIR

Satish Silveri
Nerd For Tech
Published in
8 min readJan 14, 2024

--

Source: http://tinyurl.com/m2e7b5xx

Semantic textual similarity (STS) has become a crucial aspect of various natural language processing (NLP) applications, ranging from information retrieval to question-answering systems. The ability to capture semantic information in sentences has made the SBERT models popular, making them useful for STS tasks. The right SBERT model for optimum performance must be selected when working with a dataset of yours. The BEIR (BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models) framework provides a comprehensive set of tools to evaluate and select SBERT models based on your specific data.

To learn more about semantic search and/or STS, refer to the following documentations.

The choice of SBERT models is influenced by:

  1. Type of semantic search being performed (Symmetric/Asymmetric).
  2. Distance measure (dot/cosine).

Refer to the following documentation for further insights into semantic search types and the pre-trained models (along with their architectures) tailored for these search types.

Also, check the MTEB leaderboard for more insights on models w.r.t specific NLP tasks.

Negative Sampling

A traditional Q&A dataset consists of questions and their corresponding responses/answers. In terms of model evaluation, these samples are referred to as “positive samples.” Often, we use these positive samples as input to the model evaluator to understand the ability of a particular model to map a question to its corresponding answer without seeing that data(in-domain) before. However, for a comprehensive understanding of the model’s capabilities, it is necessary to incorporate input samples where the answers are entirely unrelated to the questions. The intuition here is to gauge the model’s ability not only to recognize accurate examples but also to identify incorrect ones. These samples are referred to as “negative samples.” Negative sampling is a process of creating negative samples using contextual embeddings. We will look into how to create negative samples from the source data.

Now that we have some context, let’s dive into the process of selecting ideal SBERT models for your Information Retrieval projects.

Dataset

We will be using an Nvidia Documentation Question Answer dataset from Kaggle.

Process

  1. Create a configuration file using YAML.
data_path: NvidiaDocumentationQandApairs.csv
target_dir: /path/to/NvidiaQAEval
models:
- thenlper/gte-large
- BAAI/bge-large-en-v1.5
- intfloat/e5-large-v2

batch_size: 100
score_function: cos_sim
question_column: question
answer_column: answer
negative_samples: True
negative_sampler_model_id: random
negative_sample_size: 100
threshold_sample_fraction: 0.20
  • data_path: path to your QA dataset.
  • target_dir: path to save results.
  • models: list of model IDs or model paths on the local file system.
  • batch_size: batch size for the evaluator.
  • score_function: distance measure for evaluator (dot/cos_sim).
  • question_column: column name consisting of questions.
  • answer_column: column name consisting of answers.
  • negative_samples: flag for inclusion of negative samples.
  • negative_sampler_model_id: model id or path for negative sampling. If you select “random” as the value, a random model_id from the models’ list will be selected and used for negative sampling.
  • negative_sample_size a portion of random rows to be selected for selecting the negative sample.
  • threshold_sample_fraction: fraction of the input data to be considered for finding the threshold value for negative sampling for a given model.

2. Environment setup

#requirements.txt
beir==0.2.2
pandas==2.0.3
sentence_transformers==2.2.2
scikit-learn==1.3.0
pyyaml==6.0.2
conda create -n {your_env_name} python=3.11

conda activate {your_env_name}

pip install -r requirements.txt

3. Evaluator Class

import pandas as pd
import uuid
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from beir.retrieval import models
import yaml
import os
import json
import datetime
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import uuid
import random
from tqdm import tqdm

#for notebook
#from tqdm.notebook import tqdm

tqdm.pandas()

class EvaluateSBERTModels:

def __init__(self, config_file_path):

with open(config_file_path) as f:
config = yaml.load(f, Loader=yaml.FullLoader)

self.config = config

self.target_dir = self.config['target_dir']

if self.target_dir is None or len(self.target_dir) == 0:
self.target_dir = os.getcwd()


# evaluator inputs path
self.eval_input_base_path = os.path.join(self.target_dir,'sbert_eval_{}'.format(datetime.datetime.now().strftime("%Y%m%d%H%M%S")),'evaluator_input')
if not os.path.exists(self.eval_input_base_path):
os.makedirs(self.eval_input_base_path)

# create results base path
self.results_base_path = os.path.join(self.target_dir,'sbert_eval_{}'.format(datetime.datetime.now().strftime("%Y%m%d%H%M%S")),'results')
if not os.path.exists(self.results_base_path):
os.makedirs(self.results_base_path)

def assign_ids(self, df):
'''
Function to assign unique ids for questions and answers.
'''
df['q_id'] = [str(uuid.uuid4()) for _ in range(len(df))]
df['ans_id'] = [str(uuid.uuid4()) for _ in range(len(df))]

return df

def compute_threshold(self, df, threshold_sample_fraction = 0.30):
'''
Dynamically compute threshold value based on embeddings.

Formula:

threshold = min(cosine_sim) + ((max(cosine_sim) - min(cosine_sim))/3)

Inputs:
df: input dataframe with embeddings.
question_col: column name of the questions.
answer_col: column name of the answers.
threshold_sample_fraction: fraction of datapoints to consider for computing threshold.
Outputs:
threshold: computed threshold.
'''

frac_df = df.sample(frac=threshold_sample_fraction)

frac_df['cosine_sim'] = frac_df.apply(lambda row: self.compute_cosine_similarity(row['question_embeddings'], row['answer_embeddings']), axis=1)

similarity_scores = frac_df['cosine_sim'].tolist()

min_score = min(similarity_scores)
max_score = max(similarity_scores)

threshold = min_score + ((max_score - min_score)/3)

return threshold

def compute_cosine_similarity(self, embeddings1, embeddings2):
'''
Function to generate cosine similarity between 2 embeddings.
'''
return cosine_similarity(embeddings1.reshape(1, -1), embeddings2.reshape(1, -1))[0, 0]

def get_qrels(self, df, model_id, question_column, answer_column, negative_samples=False, negative_sample_size=10):
'''
Function that is used to generate qrels for evaluator input.

Input:
df: input dataframe
model_id: model id for sentence transformers.
question_column: name of the column containing questions.
answer_column: name of the column containing answers.
negative_samples: flag to include negative samples.
threshold: threshold value for negative samples.

Output:
qrels: Qrels for input to evalautor
'''
qrels=[]

qrels.append('query-id\tcorpus-id\tscore')

model = SentenceTransformer(model_id)

threshold = 0.0

if negative_samples:
# Compute embeddings for questions and answers
df['question_embeddings'] = df[question_column].progress_apply(lambda x: model.encode(x))
df['answer_embeddings'] = df[answer_column].progress_apply(lambda x: model.encode(x))

# dynamically compute threshold
threshold = self.compute_threshold(df = df, threshold_sample_fraction = self.config['threshold_sample_fraction'])

for _, row in tqdm(df.iterrows(), total=df.shape[0]):
# Positive sample (answer to the question)
qrels.append('{}\t{}\t1'.format(row['q_id'], row['ans_id']))

if negative_samples:
# Negative sample (random answer from another question)
candidate_negatives = df[df['q_id'] != row['q_id']].sample(negative_sample_size)
random_negative = candidate_negatives.sample(1).iloc[0]
neg_similarity = self.compute_cosine_similarity(row['answer_embeddings'], random_negative['answer_embeddings'])

while neg_similarity >= threshold:
random_negative = candidate_negatives.sample(1).iloc[0]
neg_similarity = self.compute_cosine_similarity(row['answer_embeddings'], random_negative['answer_embeddings'])

qrels.append('{}\t{}\t0'.format(row['q_id'], random_negative['ans_id']))

return qrels

def create_data_for_evaluator(self):
'''
Function to convert input data to BEIR data loader compatible format.
'''

assert len(self.config['data_path'])>0, "Data path cannot be empty."

# load data
data_df = pd.read_csv(self.config['data_path'])

# Assign unique ids to questions and answers
data_df = self.assign_ids(data_df)

corpus=[]
queries=[]

for index,item in data_df.iterrows():
data={}
query={}
data['_id'] = item['ans_id']
data['text'] = item[self.config['answer_column']]
data['title'] = ""
corpus.append(data)
query['_id'] = item['q_id']
query['text'] = item[self.config['question_column']]
queries.append(query)

negative_sampler_model_id = None

if self.config['negative_sampler_model_id'] == "random":
# select random model_id for generating embeddings
negative_sampler_model_id = random.choice(self.config['models'])
else:
negative_sampler_model_id = self.config['negative_sampler_model_id']

qrels = self.get_qrels(df=data_df, model_id=negative_sampler_model_id,
question_column = self.config['question_column'],
answer_column=self.config['answer_column'],
negative_samples=self.config['negative_samples'],
negative_sample_size = self.config['negative_sample_size'])

# write corpus
with open(os.path.join(self.eval_input_base_path,'corpus.jsonl'),'w') as f:
for index,_dict in enumerate(corpus):
if index<len(corpus)-1:
f.write(json.dumps(_dict)+'\n')
else:
f.write(json.dumps(_dict))

# write queries
with open(os.path.join(self.eval_input_base_path,'queries.jsonl'),'w') as f:
for index,_dict in enumerate(queries):
if index<len(corpus)-1:
f.write(json.dumps(_dict)+'\n')
else:
f.write(json.dumps(_dict))

# write qrels
qrels_str = '\n'.join(qrels)
with open(os.path.join(self.eval_input_base_path,'qrels.tsv'),'w') as f:
f.write(qrels_str)


def load_data_for_evaluator(self):
'''
Function to load the data for the evaluator.
'''
corpus, queries, qrels = GenericDataLoader(
corpus_file=os.path.join(self.eval_input_base_path,'corpus.jsonl'),
query_file=os.path.join(self.eval_input_base_path,'queries.jsonl'),
qrels_file=os.path.join(self.eval_input_base_path,'qrels.tsv')).load_custom()

return corpus, queries, qrels


def evaluate_model(self, model, corpus, queries, qrels):
'''
Function to evaluate a SBERT model.

Input:
model: model_path or model id.
batch_size: batch size for input.
score_function: distance measure ('dot' or 'cos_sim')
Output:
ndgc: Normalized Discounted cumulative gain scores for a given model.
_map: Mean average precision scores for a given model.
recall: Recall scores for a given model.
precision: Precision scores for a given model.
'''

batch_size = self.config['batch_size']
if batch_size is None:
batch_size = 64

score_function = self.config['score_function']
if score_function is None:
score_function = "dot"


model = DRES(models.SentenceBERT(model), batch_size=batch_size)
retriever = EvaluateRetrieval(model, score_function=score_function)
results = retriever.retrieve(corpus, queries)

ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

return ndcg, _map, recall, precision


def run_evaluator(self):
'''
Pipeline function to run the evaluator.
'''

assert len(self.config['models'])>0, "Evaluator requires 1 or more models."

ndgc_results=[]
map_results=[]
recall_results=[]
precision_results=[]

# create data for evalutor
self.create_data_for_evaluator()

# load data for evaluator
corpus, queries, qrels = self.load_data_for_evaluator()

# analyze models
for model in self.config['models']:

ndcg, _map, recall, precision = self.evaluate_model(model, corpus, queries, qrels)
ndcg['Model'] = model
ndgc_results.append(ndcg)
_map['Model'] = model
map_results.append(_map)
recall['Model'] = model
recall_results.append(recall)
precision['Model'] = model
precision_results.append(precision)


with open(os.path.join(self.results_base_path,'NDGC.json'), 'w') as f:
f.write(json.dumps(ndgc_results))

with open(os.path.join(self.results_base_path,'MAP.json'), 'w') as f:
f.write(json.dumps(map_results))

with open(os.path.join(self.results_base_path,'RECALL.json'), 'w') as f:
f.write(json.dumps(recall_results))

with open(os.path.join(self.results_base_path,'PRECISION.json'), 'w') as f:
f.write(json.dumps(precision_results))
sbert_evaluator = EvaluateSBERTModels(config_file_path='config.yaml')
sbert_evaluator.run_evaluator()

Features of EvaluateSBERTModels class:

  • Negative Samples generator (get_qrels): BEIR’s evaluator requires input from three files: queries, corpus, and qrels. Queries represent the questions in the samples, while the corpus contains the corresponding answers/responses to these queries (click to check the format of these files). The qrels file distinguishes positive (and negative if the flag is set to True) samples using 1 and 0, respectively. We assign a unique ID to each question and answer to generate this file. Positive samples are identified by marking the question-answer pair in the data with 1. For negative samples, we conduct random sampling and select a random answer for a question from the random samples based on a computed threshold.
  • Automated threshold computation(compute_threshold): One of the challenges in working with encoders/SBERT models is that the computed similarity scores are distributed around a mean value, i.e. the scores are between a min and a max value. Hence, each encoder would have a different distribution and selecting a static threshold value would be difficult. So, we are going to dynamically compute the threshold value by selecting a fraction of the input data, compute similarity scores between Q&A and compute the threshold using the following formula:
threshold = min(scores) + ((max(scores) - min(scores))/3)
  • Function(create_data_for_evaluator) to generate the input data for the evaluator as per BEIR guidelines.

Results

Although gte-large and e5-large-v2 exhibit similar outcomes when evaluated with NDCG(Normalized Discounted Cumulative Gain) and MAP(Mean Average Precision), e5-large-v2 demonstrates slightly superior performance, particularly when confronted with unseen data. This suggests that e5-large-v2 is a more effective model for representing the dataset in the context of semantic search and information retrieval tasks. For further insights into NDCG and MAP, refer to these informative articles: NDCG and MAP.

Conclusion

Selecting the right SBERT model for your dataset is crucial for achieving optimal performance in STS tasks and beyond. The BEIR framework simplifies this process by providing tools for benchmarking and evaluation. By incorporating these tools into a custom SBERT model selection class, you can streamline the process and make informed decisions based on your specific data.

Remember to adapt the placeholder class provided to suit the specific requirements of your dataset and evaluation criteria. Utilizing BEIR’s capabilities will empower you to make informed choices and enhance the semantic understanding of your NLP applications.

Thank you for reading! If you have any recommendations or corrections, please do comment. I would love to learn more :)

You can find the Jupyter Notebook on my GitHub Repository.

--

--