Sparse Vector on ottercoconut's Blog

Practical BM25

Sat, 27 Jun 2026 00:00:00 +0800

References

Practical BM25 - Part 1: How Shards Affect Relevance Scoring in Elasticsearch | Elastic Blog

Practical BM25 - Part 2: The BM25 Algorithm and its Variables | Elastic Blog

Practical BM25 - Part 3: Considerations for Picking b and k1 in Elasticsearch | Elastic Blog

Background

In Elasticsearch 5.0, the default similarity algorithm was changed to Okapi BM25, which is used to score the relevance between search results and a query. This post focuses on the practical side of BM25, including its available parameters and the factors that affect scoring.

Understanding How Shards Affect Scoring

Before learning BM25, it is necessary to understand that an Elasticsearch index can be split into multiple shards, which are physical partitions of the index. This matters because BM25 relevance scores are not naturally calculated from global statistics across the entire index. By default, they may be calculated separately inside each shard. The more shards there are, and the less data each shard contains, the easier it is for scoring bias to appear.

Below, we follow the example from the reference article. The goal is to create an Elasticsearch index named people, insert a few test documents, and repeatedly search for the same query term "Shane" to observe how BM25 relevance scores change with document count and shard distribution.

The author creates an index named people, sets it to have 5 primary shards, and uses BM25 as the default similarity algorithm:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


PUT people
{
 "settings": {
 "number_of_shards": 5,
 "index" : {
 "similarity" : {
 "default" : {
 "type" : "BM25"
 }
 }
 }
 }
}

The author uses his own name as the example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


PUT /people/_doc/1
{
 "title": "Shane"
}
GET /people/_doc/_search
{
 "query": {
 "match": {
 "title": "Shane"
 }
 }
}

The search looks for documents whose title field matches "Shane", so it naturally matches /people/_doc/1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


PUT /people/_doc/2
{
 "title": "Shane C"
}
PUT /people/_doc/3
{
 "title": "Shane Connelly"
}
PUT /people/_doc/4
{
 "title": "Shane P Connelly"
}

Then the same search is run again:

1
2
3
4
5
6
7
8


GET /people/_doc/_search
{
 "query": {
 "match": {
 "title": "Shane"
 }
 }
}

At this point there are 4 “documents”:

Shane
Shane C
Shane Connelly
Shane P Connelly

The search finds documents whose title field matches "Shane". Although all titles contain "Shane", their BM25 scores are not the same. The result is that doc1 and doc3 both score 0.2876821, while doc2 scores 0.19856805 and doc4 scores 0.16853254.

Although doc2 and doc3 look similar, their scores differ a lot. This is not mainly caused by the difference between "C" and "Connelly", but by how documents are distributed across shards. So how can the scores become more consistent?

The larger the dataset, the smaller the statistical difference between shards.
Reducing the number of shards can reduce scoring bias.
If you want BM25 scores under multiple shards to be closer to “global statistics”, you can add ?search_type=dfs_query_then_fetch when querying. It collects term-frequency statistics from all shards first, then calculates scores in a unified way, so the result will be close to, or even the same as, the result when number_of_shards=1.

dfs_query_then_fetch first aggregates term-frequency statistics across shards and then calculates BM25 scores, making multi-shard scoring closer to single-shard global scoring. However, it adds one extra communication round, so it is only worth using when the dataset is small, there are many shards, the data distribution is uneven, and relevance scores matter a lot.

Algorithm and its variables

BM25 model:

$$ \sum_{i}^{n} IDF(q_i) \frac{f(q_i, D) * (k_1 + 1)}{f(q_i, D) + k_1 * (1 - b + b * \frac{fieldLen}{avgFieldLen})} $$

$q_i$: the $i$-th keyword in the query.
$IDF(q_i)$: the inverse document frequency of keyword $q_i$.
$f(q_i, D)$: the term frequency of keyword $q_i$ in document $D$.
$fieldLen$: the length of the current document field.
$avgFieldLen$: the average field length across all documents in the index.
$k_1$ and $b$: tunable parameters. Usually $k1 \in [1.2, 2.0]$, and $b = 0.75$.

In simple terms, BM25 is a TF-IDF model that introduces nonlinearity and handles the frequency saturation problem. The TF-IDF model is:

$$ \text{Score} = f(q_i, D) \times \log\left(\frac{N}{n(q_i)}\right) $$

$q_i$

For example, if I search for “shane”, there is only one query term, so $q_0$ is “shane”. If I search for “shane connelly” in English, Elasticsearch recognizes the space and tokenizes the query into two terms: $q_0$ is “shane”, and $q_1$ is “connelly”. These query terms are substituted into the other parts of the formula, and the final results are summed.

$IDF(q_i)$

The IDF (Inverse Document Frequency) part of the formula measures how frequently a term appears across all documents. It “penalizes” common terms by lowering their weight. In the Lucene/BM25 algorithm, the actual formula is:

$$ \ln \left( 1 + \frac{(docCount - f(q_i) + 0.5)}{f(q_i) + 0.5} \right) $$

Here, $docCount$ is the total number of documents in this shard that contain a value for this field. If the search_type=dfs_query_then_fetch parameter is used, it is the count across all shards. $f(q_i)$ is the number of documents containing the $i$-th query term. In the example, the term “shane” appears in all 4 documents, so the inverse document frequency $IDF(\text{"shane"})$ is:

$$ \ln\left(1 + \frac{(4 - 4 + 0.5)}{4 + 0.5}\right) = \ln\left(1 + \frac{0.5}{4.5}\right) = 0.105360515657826 $$

$IDF(\text{"connelly"})$ is:

$$ \ln\left(1 + \frac{(4 - 2 + 0.5)}{2 + 0.5}\right) = \ln\left(1 + \frac{2.5}{2.5}\right) = 0.693147180559945 $$

We can see that queries containing rarer terms have a higher multiplier. In this 4-document corpus, “connelly” is rarer than “shane”, so it contributes more to the final score. This matches intuition: the word “the” may appear in almost every English document, so when a user searches for something like “the elephant”, “elephant” is clearly more important than “the”, and we also expect it to contribute more to the search score.

$fieldLen/avgFieldLen$

The more terms a document contains, at least terms that do not match the query, the lower the document score tends to be. This also matches intuition: if a 300-page document mentions my name only once, it is probably less relevant than a short tweet that also mentions my name once.

$b$

The larger the value of $b$, the more the document length ratio affects the score. To understand this, imagine setting $b$ to 0. In that case, the length ratio has no effect at all, and the score is only affected by term frequency. Document length does not affect scoring. If $b$ is set to 1, the score is affected only by the length ratio and not by frequency.

$f(q_i, D)$

This value corresponds to TF, or Term Frequency.

$f(q_i, D)$ means: how many times does the $i$-th query term appear in document $D$? In all of the example documents, $f(\text{"shane"}, D)$ is 1, but $f(\text{"connelly"}, D)$ differs: it is 1 in documents 3 and 4, and 0 in documents 1 and 2. If there were a 5th document whose text was “shane shane”, then $f(\text{"shane"}, D)$ would be 2. We can see that $f(q_i, D)$ appears in both the numerator and denominator, together with a special factor called “$k_1$”, which is discussed below. The basic intuition is that the more often a query term appears in a document, the higher the score becomes. A document that mentions our name multiple times is more likely to be relevant than one that mentions it only once.

$k_1$

In BM25, $k_1$ is the core parameter controlling term frequency saturation. It sets an asymptotic upper bound for the contribution of $f(q_i, D)$ to the relevance score, making the marginal gain decrease nonlinearly as term frequency increases. Compared with the almost linear weight growth in traditional TF-IDF, this mechanism effectively suppresses excessive ranking influence from high-frequency terms, such as keyword stuffing. The value of $k_1$ directly determines how quickly the score approaches saturation: a smaller $k_1$ makes term frequency contribution hit the bottleneck quickly, while a larger $k_1$ allows term frequency to maintain meaningful weight gains over a wider range.

If $k_1$ is set to 0, the score becomes fixed at 1. If $k_1$ is set to a very large value, such as 10000, the formula approximately degenerates into $\frac{TF \times k_1}{k_1} = TF$, becoming term frequency itself.

Picking $b$ and $k_1$

Regarding the values of $b$ and $k_1$, the Elasticsearch article also points out that the current defaults are empirical values that work for most cases, but there is no globally optimal b and k1. They must be evaluated together with the corpus and queries.

Also, when retrieval performance is not good enough, the following should be optimized before tuning $b$ and $k_1$:

Boost exact phrase matches.
Use synonyms to expand expressions that users may care about.
Use analysis components such as fuzziness, typeahead, phonetic matching, and stemming to handle spelling mistakes, language differences, and word-form variations.
Use function score to adjust document scores based on publish time, geographical distance, or business features.

As for the Explain API in the later part of the Elasticsearch article, I will not expand on it here.

Sparse Vectors and the SPLADE Model

Sat, 27 Jun 2026 00:00:00 +0800

References

Sparse Vectors and the SPLADE Model

In RAG systems, dense vectors have become the most common retrieval method. They map text into a continuous vector space and are good at capturing semantically similar expressions, such as “employee resignation process” and “personnel exit procedure”. However, dense vectors also have clear weaknesses: they are not always good at exact matching for entities, IDs, terminology, error codes, product models, table field names, and code snippets.

This is where sparse vectors become valuable. They are more like a neural-network-enhanced inverted index: text is still represented as sparse weights over term dimensions, but these weights are not calculated by pure statistical methods like BM25. Instead, they are predicted by a model.

In short:

Dense vectors handle semantic similarity.
BM25 handles exact lexical matching.
SPLADE sparse vectors handle weighted matching after neural term expansion.
Hybrid Search merges dense and sparse retrieval results.

Paper model

SPLADE maps a piece of text into vocabulary space based on the logits of a Masked Language Model. Suppose the vocabulary contains 30522 WordPiece tokens. Each text can eventually be represented as:

1

token_id -> weight

This is a sparse vector. Most token weights are 0, and only a small number of tokens that the model considers important have non-zero weights.

The biggest difference from ordinary embeddings is that each dimension in a dense embedding is usually not interpretable, while each dimension in a sparse vector is a vocabulary token. A token activated by the model can be understood as “this text is related to this term”.

For example, a document may not explicitly contain the word “reimbursement”, but it may contain “travel expense”, “invoice”, and “approval form”. SPLADE may activate tokens related to “reimbursement”. Then, when the query is “reimbursement process”, the document may still be retrieved even if it does not exactly match the original term.

More specifically, SPLADE uses the logits from the Masked Language Model layer to predict the importance of each term in the BERT WordPiece vocabulary. Suppose the tokenized input text is:

$$ t=(t_{1},t_{2},...,t_{N}) $$

and the corresponding contextual representations are:

$$ (h_{1},h_{2},...,h_{N}) $$

For the $i$-th token in the input, the model calculates its importance for the $j$-th token in the vocabulary:

$$ w_{ij}=transform(h_{i})^{T}E_{j}+b_{j}, \quad j\in\{1,...,|V|\} $$

Here, $E_j$ is the BERT input embedding of vocabulary ${token}_j$, and $b_j$ is the token-level bias. transform(.) is usually a linear transformation with GeLU and LayerNorm. Intuitively, this step asks: for this position in the input, how related is it to each term in the vocabulary?

However, retrieval does not need “the score of a term at one position”. It needs “the score of a term for the whole text”. Therefore, SPLADE aggregates activations from different positions into a sparse representation for the whole text:

$$ w_{j}=\sum_{i\in t}\log(1+ReLU(w_{ij})) $$

There are three meanings in this formula:

ReLU sets negative scores to zero and keeps only positively related terms.
$log(1+x)$ performs logarithmic saturation, preventing scores of frequent or repeated words from growing without bound.
$\sum$ accumulates activations from different positions for the same vocabulary token, producing the term weight for the whole text.

Finally, the text becomes a high-dimensional but sparse vector:

1

token_id -> weight

After both the query and document are mapped into the same vocabulary space, the retrieval score is the dot product of sparse vectors:

$$ s(q,d)=\sum_j w_j^q w_j^d $$

This is also why SPLADE can be connected to inverted indexes or sparse vector indexes.

Ranking loss

During training, SPLADE needs to make relevant documents score higher and irrelevant documents score lower. Given a query $q_i$, a positive document $d_i^+$, a hard negative document $d_i^-$, and a group of in-batch negative documents ${d_{i,j}^{-}}$, a contrastive ranking loss similar to the following can be used:

$$ \mathcal{L}_{rank-IBN} = -\log \frac{e^{s(q_i,d_i^+)}} {e^{s(q_i,d_i^+)} + e^{s(q_i,d_i^-)} + \sum e^{s(q_i,d_{i,j}^{-})}} $$

Its goal is direct: make the probability of the positive document as large as possible within the candidate set. From an engineering perspective, the model keeps learning which term expansions help it rank the correct document higher.

FLOPS sparsity regularization

If only ranking quality is optimized, the model may activate too many tokens. This may improve recall, but the inverted index becomes larger, and queries need to access more posting lists.

Therefore, SPLADE introduces FLOPS regularization to control sparsity. For a batch of documents, first estimate the average activation of vocabulary token (j) in this batch:

$$ \overline{a}_{j}=\frac{1}{N}\sum_{i=1}^{N}w_{j}^{(d_i)} $$

Then square and sum the average activations:

$$ l_{FLOPS}=\sum_{j\in V}\overline{a}_{j}^{2} =\sum_{j\in V}(\frac{1}{N}\sum_{i=1}^{N}w_{j}^{(d_i)})^{2} $$

This regularization term is not simply controlling “vector dimensionality”. It controls the number and distribution of non-zero tokens. It tries to prevent the model from binding many documents to a few high-frequency words, and also prevents every document from activating too many terms.

Therefore, the sparsity weight can be understood as a knob between recall quality and retrieval cost:

Larger weight: shorter sparse vectors, smaller index, faster retrieval, but possibly lower recall.
Smaller weight: longer sparse vectors and richer expansion, but higher index and retrieval cost.

Overall loss

Finally, SPLADE trains ranking loss and sparsity regularization together:

$$ \mathcal{L}=\mathcal{L}_{rank-IBN} +\lambda_q\mathcal{L}_{reg}^{q} +\lambda_d\mathcal{L}_{reg}^{d} $$

Here, (\lambda_q) controls query-side sparsity, and (\lambda_d) controls document-side sparsity. Query-side sparsity is usually very important because queries are more sensitive to latency. Document-side vectors can be computed offline, so slightly higher compute cost is often acceptable, but index size still needs to be controlled.

From sum pooling to max pooling

The original SPLADE aggregates term predictions from every input position:

$$ w_{j}=\sum_{i\in t}\log(1+ReLU(w_{ij})) $$

The more common later SPLADE-max uses max pooling:

$$ w_{j}=\max_{i\in t}\log(1+ReLU(w_{ij})) $$

This does not mean the whole text only keeps one token. Instead, it takes the maximum activation separately for each vocabulary dimension. This can reduce amplification from long text or repeated words, making the representation focus more on whether a semantic term is strongly activated, rather than simply depending on occurrence count.

SPLADE-doc and distillation training

Standard SPLADE encodes both query and document. In other words, both query-side and document-side representations may produce neural expansion terms. Retrieval calculates:

$$ s(q,d)=\sum_j w_j^q w_j^d $$

SPLADE-doc is more focused on engineering efficiency. It only applies SPLADE encoding on the document side, while the query side usually uses only the original query tokens. The document score can be written as:

$$ s(q,d)=\sum_{j\in q}w_j^d $$

This means document-side expansion can be precomputed offline, and the query side does not need to run a SPLADE encoder, reducing latency. The tradeoff is that the query side has no neural expansion ability and can only use “document-side expansion”.

In addition, many strong SPLADE models use knowledge distillation and hard negatives. A common approach is to first train a first-stage retriever and a cross-encoder reranker, then continue training with harder negatives and reranker scores. In engineering practice, we do not have to reproduce this whole training pipeline to use public models. But understanding it helps explain why words like distil, ensemble, and cocondenser appear in model names.

Why sparsity matters

If the model activates many tokens, recall may improve, but the index becomes larger and retrieval becomes slower. SPLADE uses FLOPS regularization to control the number and distribution of non-zero tokens.

From an engineering perspective, sparse vectors are not better just because they are longer.

Too few non-zero tokens: the index is small and retrieval is fast, but recall may be insufficient.
Too many non-zero tokens: recall may be better, but the index expands and retrieval cost increases.

In practice, secondary pruning is often applied, such as:

Keeping only the top_k tokens.
Filtering tokens whose weight is below a threshold.
Limiting the maximum number of sparse dimensions for a single chunk.

These parameters often affect online cost more than the model itself.

Model selection

SPLADE is more like a family of sparse neural retrieval methods than a single model. The official Naver repository also notes that different regularization strengths produce models ranging from “very sparse” to “strong query/doc expansion”. Their effectiveness, index size, and latency all differ.

If the goal is only to quickly validate engineering feasibility, naver/splade-cocondenser-ensembledistil is a good starting point. It is a common strong model in the official SPLADE++ series. The Naver repository reports its MS MARCO dev MRR@10 as 38.3, higher than splade_v2_max at 34.0 and splade_v2_distil at 36.8. It is suitable for first checking whether sparse retrieval can fill the keyword, entity, and terminology recall gaps of dense retrieval.

If inference cost matters more, consider naver/splade_v2_max or the efficient SPLADE series. splade_v2_max is structurally simple. Its Hugging Face model page marks it as DistilBERT base, with a 512-token maximum length, 30522-dimensional output, and dot-product similarity. The efficient SPLADE series further separates document encoder and query encoder, aiming to reduce query-side latency.

A practical selection order is:

First choose a strong public model for offline evaluation, such as naver/splade-cocondenser-ensembledistil.
If offline evaluation is effective, then measure average non-zero token count, index size, document-side encoding throughput, and query-side P95 latency.
If query-side latency is too high, first try query caching, ONNX/OpenVINO, quantization, or efficient SPLADE.
If the index is too large, first reduce top-k, increase the minimum weight threshold, or choose a model with stronger regularization and higher sparsity.
If business data differs greatly from public English retrieval datasets, consider fine-tuning with domain data instead of directly trusting public leaderboards.

Do not choose a model only by MRR. SPLADE model selection should consider at least five things at the same time: retrieval quality, average non-zero dimensions, index size, query latency, and deployment complexity.

Sentence Transformers now provides SparseEncoder, which can directly load SPLADE models:

1
2
3
4


from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
embeddings = model.encode(["example query"])

It also provides encode_query(), encode_document(), sparsity statistics, Qdrant/Elasticsearch/OpenSearch integration, and deployment capabilities related to ONNX/OpenVINO/quantization. For engineering prototypes, this route can be used first, and then the implementation can be moved to a custom inference service depending on performance bottlenecks.

Differences between SPLADE and BM25

BM25 and SPLADE can both use inverted indexes for retrieval, but their weights come from different sources.

BM25 weights come from statistics, such as TF, IDF, and document length normalization. It mainly depends on exact matching between query terms and document terms.

SPLADE weights come from neural model predictions. It can not only preserve tokens that appear in the original text, but may also activate semantically related tokens that do not appear in the original text.

So it can be roughly understood as:

1
2


BM25 = statistical matching of original terms
SPLADE = weighted matching of neural expansion terms

In enterprise knowledge bases, technical documentation, customer-service FAQs, code documentation, policies, and regulations, both BM25 and SPLADE are valuable. BM25 is lighter, while SPLADE is stronger but more expensive.