enhancement: improve Sentence Transformers Embedding Backend Factory

## Background

The internal class [`_SentenceTransformersEmbeddingBackendFactory`](https://github.com/deepset-ai/haystack/blob/main/haystack/components/embedders/backends/sentence_transformers_backend.py) ensures that when both a text embedder and a document embedder use the same model, the model is loaded only once. (This also works for two text embedders, etc.)

This is done by computing an `embedding_backend_id`. When a new embedder is warmed up, we check if the ID already exists → reuse the model; otherwise, load a new one and store it.


## Issue
This the current implementation of `embedding_backend_id`
https://github.com/deepset-ai/haystack/blob/1fb76ec7e475a53d15c32f550c93277209e238ca/haystack/components/embedders/backends/sentence_transformers_backend.py#L37

It does not account for all parameters that affect how the model is actually loaded in Sentence Transformers.

For example, `model_kwargs` and `tokenizer_kwargs` change the configuration of the underlying model, but are not reflected in the ID. This can cause different embedders to incorrectly share the same backend:

```python
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

emb1 = SentenceTransformersDocumentEmbedder(model="mymodel")

emb2 = SentenceTransformersDocumentEmbedder(model="mymodel", tokenizer_kwargs={"model_max_length": 512})

print(emb1.embedding_backend is emb2.embedding_backend)
# True !!!
```

This could be considered a bug, although in practice it may not affect many users.

## Potential improvements
In the `_SentenceTransformersSparseEmbeddingBackendFactory`, I adopted a different strategy: the ID includes all parameters accepted by Sentence Transformers at model loading.

https://github.com/deepset-ai/haystack/blob/1fb76ec7e475a53d15c32f550c93277209e238ca/haystack/components/embedders/backends/sentence_transformers_sparse_backend.py#L48

I think this is a better approach and should also be adopted in `_SentenceTransformersEmbeddingBackendFactory`.

## Minor additional improvement
I suggest also modifying `_SentenceTransformersSparseEmbeddingBackendFactory.get_embedding_backend` and `_SentenceTransformersSparseEncoderEmbeddingBackend.__init__` to only accept kwargs. Since these are internal classes, we can safely make this change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enhancement: improve Sentence Transformers Embedding Backend Factory #9804

Background

Issue

Potential improvements

Minor additional improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

enhancement: improve Sentence Transformers Embedding Backend Factory #9804

Description

Background

Issue

Potential improvements

Minor additional improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions