Skip to content

Commit 1cba7e3

Browse files
ahmedlone127DevinTDHaagsfermaziyarpanahi
authored
Add Pooling Average to Broken XXXForSentenceEmbedding annotators (#14328)
* SPARKNLP-1036: Onnx Example notebooks (#14234) * SPARKNLP-1036: Fix dev python kernel names * SPARKNLP-1036: Bump transformers version * SPARKNLP-1036: Fix Colab buttons * SPARKNLP-1036: Pin onnx version for compatibility * SPARKNLP-1036: Upgrade Spark version * SPARKNLP-1036: Minor Fixes * SPARKNLP-1036: Clean Metadata * SPARKNLP-1036: Add/Adjust Documentation - Note for supported Spark Version of Annotators - added missing Documentation for BGEEmbeddings * Fixies (#14307) * adding fix for broken annotators --------- Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com> Co-authored-by: Lev <agsfer@gmail.com> Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>
1 parent 85c90dd commit 1cba7e3

File tree

57 files changed

+30397
-31852
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+30397
-31852
lines changed
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
{%- capture title -%}
2+
BGEEmbeddings
3+
{%- endcapture -%}
4+
5+
{%- capture description -%}
6+
Sentence embeddings using BGE.
7+
8+
BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense
9+
vector which can be used for tasks like retrieval, classification, clustering, or semantic
10+
search.
11+
12+
Note that this annotator is only supported for Spark Versions 3.4 and up.
13+
14+
Pretrained models can be loaded with `pretrained` of the companion object:
15+
16+
```scala
17+
val embeddings = BGEEmbeddings.pretrained()
18+
.setInputCols("document")
19+
.setOutputCol("embeddings")
20+
```
21+
22+
The default model is `"bge_base"`, if no name is provided.
23+
24+
For available pretrained models please see the
25+
[Models Hub](https://sparknlp.org/models?q=BGE).
26+
27+
For extended examples of usage, see
28+
[BGEEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/BGEEmbeddingsTestSpec.scala).
29+
30+
**Sources** :
31+
32+
[C-Pack: Packaged Resources To Advance General Chinese Embedding](https://arxiv.org/pdf/2309.07597)
33+
34+
[BGE Github Repository](https://github.com/FlagOpen/FlagEmbedding)
35+
36+
**Paper abstract**
37+
38+
*We introduce C-Pack, a package of resources that significantly advance the field of general
39+
Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive
40+
benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive
41+
text embedding dataset curated from labeled and unlabeled Chinese corpora for training
42+
embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models
43+
outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the
44+
release. We also integrate and optimize the entire suite of training methods for C-TEM. Along
45+
with our resources on general Chinese embedding, we release our data and models for English
46+
text embeddings. The English models achieve stateof-the-art performance on the MTEB benchmark;
47+
meanwhile, our released English data is 2 times larger than the Chinese data. All these
48+
resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.*
49+
{%- endcapture -%}
50+
51+
{%- capture input_anno -%}
52+
DOCUMENT
53+
{%- endcapture -%}
54+
55+
{%- capture output_anno -%}
56+
SENTENCE_EMBEDDINGS
57+
{%- endcapture -%}
58+
59+
{%- capture python_example -%}
60+
import sparknlp
61+
from sparknlp.base import *
62+
from sparknlp.annotator import *
63+
from pyspark.ml import Pipeline
64+
documentAssembler = DocumentAssembler() \
65+
.setInputCol("text") \
66+
.setOutputCol("document")
67+
embeddings = BGEEmbeddings.pretrained() \
68+
.setInputCols(["document"]) \
69+
.setOutputCol("bge_embeddings")
70+
embeddingsFinisher = EmbeddingsFinisher() \
71+
.setInputCols(["bge_embeddings"]) \
72+
.setOutputCols("finished_embeddings") \
73+
.setOutputAsVector(True)
74+
pipeline = Pipeline().setStages([
75+
documentAssembler,
76+
embeddings,
77+
embeddingsFinisher
78+
])
79+
data = spark.createDataFrame([["query: how much protein should a female eat",
80+
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + \
81+
"But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + \
82+
"marathon. Check out the chart below to see how much protein you should be eating each day.",
83+
]]).toDF("text")
84+
result = pipeline.fit(data).transform(data)
85+
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
86+
+--------------------------------------------------------------------------------+
87+
| result|
88+
+--------------------------------------------------------------------------------+
89+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
90+
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
91+
+--------------------------------------------------------------------------------+
92+
{%- endcapture -%}
93+
94+
{%- capture scala_example -%}
95+
import spark.implicits._
96+
import com.johnsnowlabs.nlp.base.DocumentAssembler
97+
import com.johnsnowlabs.nlp.annotators.Tokenizer
98+
import com.johnsnowlabs.nlp.embeddings.BGEEmbeddings
99+
import com.johnsnowlabs.nlp.EmbeddingsFinisher
100+
import org.apache.spark.ml.Pipeline
101+
102+
val documentAssembler = new DocumentAssembler()
103+
.setInputCol("text")
104+
.setOutputCol("document")
105+
106+
val embeddings = BGEEmbeddings.pretrained("bge_base", "en")
107+
.setInputCols("document")
108+
.setOutputCol("bge_embeddings")
109+
110+
val embeddingsFinisher = new EmbeddingsFinisher()
111+
.setInputCols("bge_embeddings")
112+
.setOutputCols("finished_embeddings")
113+
.setOutputAsVector(true)
114+
115+
val pipeline = new Pipeline().setStages(Array(
116+
documentAssembler,
117+
embeddings,
118+
embeddingsFinisher
119+
))
120+
121+
val data = Seq("query: how much protein should a female eat",
122+
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." +
123+
But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" +
124+
marathon. Check out the chart below to see how much protein you should be eating each day."
125+
126+
).toDF("text")
127+
val result = pipeline.fit(data).transform(data)
128+
129+
result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
130+
+--------------------------------------------------------------------------------+
131+
| result|
132+
+--------------------------------------------------------------------------------+
133+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
134+
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
135+
+--------------------------------------------------------------------------------+
136+
{%- endcapture -%}
137+
138+
{%- capture api_link -%}
139+
[BGEEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/BGEEmbeddings)
140+
{%- endcapture -%}
141+
142+
{%- capture python_api_link -%}
143+
[BGEEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/bge_embeddings/index.html#sparknlp.annotator.embeddings.bge_embeddings.BGEEmbeddings)
144+
{%- endcapture -%}
145+
146+
{%- capture source_link -%}
147+
[BGEEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/BGEEmbeddings.scala)
148+
{%- endcapture -%}
149+
150+
{% include templates/anno_template.md
151+
title=title
152+
description=description
153+
input_anno=input_anno
154+
output_anno=output_anno
155+
python_example=python_example
156+
scala_example=scala_example
157+
api_link=api_link
158+
python_api_link=python_api_link
159+
source_link=source_link
160+
%}

docs/en/annotators.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ There are two types of Annotators:
4545
{:.table-model-big}
4646
|Annotator|Description|Version |
4747
|---|---|---|
48+
{% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
4849
{% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}
4950
{% include templates/anno_table_entry.md path="" name="Chunk2Doc" summary="Converts a `CHUNK` type column back into `DOCUMENT`. Useful when trying to re-tokenize or do further analysis on a `CHUNK` result."%}
5051
{% include templates/anno_table_entry.md path="" name="ChunkEmbeddings" summary="This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs."%}

docs/en/auxiliary.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ import com.johnsnowlabs.nlp.Annotation
6666
**Examples:**
6767

6868
Complete usage examples can be seen here:
69-
https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb
69+
[https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/234-release-candidate/jupyter/annotation/english/spark-nlp-basics/spark-nlp-basics-functions.ipynb)
7070

7171
<div class="tabs-box tabs-new" markdown="1">
7272

docs/en/install.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -760,7 +760,7 @@ Finally, use **jupyter_notebook_config.json** for the password:
760760
761761
In order to fully take advantage of Spark NLP on Windows (8 or 10), you need to setup/install Apache Spark, Apache Hadoop, Java and a Pyton environment correctly by following the following instructions: [https://github.com/JohnSnowLabs/spark-nlp/discussions/1022](https://github.com/JohnSnowLabs/spark-nlp/discussions/1022)
762762
763-
</div><div class="h3-box" markdown="1">\
763+
</div><div class="h3-box" markdown="1">
764764
765765
### How to correctly install Spark NLP on Windows
766766

docs/en/transformer_entries/E5Embeddings.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ Sentence embeddings using E5.
88
E5, an instruction-finetuned text embedding model that can generate text embeddings tailored
99
to any task (e.g., classification, retrieval, clustering, text evaluation, etc.)
1010

11+
Note that this annotator is only supported for Spark Versions 3.4 and up.
12+
1113
Pretrained models can be loaded with `pretrained` of the companion object:
1214

1315
```scala

docs/en/transformer_entries/MPNetEmbeddings.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. MPNet a
1010
pre-training method, named masked and permuted language modeling, to inherit the advantages of
1111
masked language modeling and permuted language modeling for natural language understanding.
1212

13+
Note that this annotator is only supported for Spark Versions 3.4 and up.
14+
1315
Pretrained models can be loaded with `pretrained` of the companion object:
1416

1517
```scala

docs/en/transformers.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ modify_date: "2023-06-18"
99
use_language_switcher: "Python-Scala-Java"
1010
show_nav: true
1111
sidebar:
12-
nav: sparknlp
12+
nav: sparknlp
1313
---
1414

1515
<script> {% include scripts/transformerUseCaseSwitcher.js %} </script>

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ data:
314314

315315
- title:
316316
image:
317-
src: https://upload.wikimedia.org/wikipedia/fr/thumb/8/8e/Centre_national_de_la_recherche_scientifique.svg/2048px-Centre_national_de_la_recherche_scientifique.svg.png
317+
src: https://iscpif.fr/wp-content/uploads/2023/11/Logo-CNRS-ISCPIF.png
318318
url: https://iscpif.fr/
319319
style: "padding: 30px;"
320320
is_row: true
@@ -344,7 +344,7 @@ data:
344344
is_row: true
345345
- title:
346346
image:
347-
src: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Columbia_University_shield.svg/1184px-Columbia_University_shield.svg.png
347+
src: https://miro.medium.com/v2/resize:fit:1024/0*3qIWoFnZgVUtsXB-.png
348348
url: https://www.columbia.edu/
349349
style: "padding: 25px;"
350350
is_row: true

0 commit comments

Comments
 (0)