Skip to content

enhancement: improve Sentence Transformers Embedding Backend Factory #9804

@anakin87

Description

@anakin87

Background

The internal class _SentenceTransformersEmbeddingBackendFactory ensures that when both a text embedder and a document embedder use the same model, the model is loaded only once. (This also works for two text embedders, etc.)

This is done by computing an embedding_backend_id. When a new embedder is warmed up, we check if the ID already exists → reuse the model; otherwise, load a new one and store it.

Issue

This the current implementation of embedding_backend_id

embedding_backend_id = f"{model}{device}{auth_token}{truncate_dim}{backend}"

It does not account for all parameters that affect how the model is actually loaded in Sentence Transformers.

For example, model_kwargs and tokenizer_kwargs change the configuration of the underlying model, but are not reflected in the ID. This can cause different embedders to incorrectly share the same backend:

from haystack.components.embedders import SentenceTransformersDocumentEmbedder

emb1 = SentenceTransformersDocumentEmbedder(model="mymodel")

emb2 = SentenceTransformersDocumentEmbedder(model="mymodel", tokenizer_kwargs={"model_max_length": 512})

print(emb1.embedding_backend is emb2.embedding_backend)
# True !!!

This could be considered a bug, although in practice it may not affect many users.

Potential improvements

In the _SentenceTransformersSparseEmbeddingBackendFactory, I adopted a different strategy: the ID includes all parameters accepted by Sentence Transformers at model loading.

embedding_backend_id = json.dumps(cache_params, sort_keys=True, default=str)

I think this is a better approach and should also be adopted in _SentenceTransformersEmbeddingBackendFactory.

Minor additional improvement

I suggest also modifying _SentenceTransformersSparseEmbeddingBackendFactory.get_embedding_backend and _SentenceTransformersSparseEncoderEmbeddingBackend.__init__ to only accept kwargs. Since these are internal classes, we can safely make this change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Contributions wanted!Looking for external contributionsP2Medium priority, add to the next sprint if no P1 availabletype:bugSomething isn't workingtype:enhancement

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions