|
| 1 | +{%- capture title -%} |
| 2 | +BGEEmbeddings |
| 3 | +{%- endcapture -%} |
| 4 | + |
| 5 | +{%- capture description -%} |
| 6 | +Sentence embeddings using BGE. |
| 7 | + |
| 8 | +BGE, or BAAI General Embeddings, a model that can map any text to a low-dimensional dense |
| 9 | +vector which can be used for tasks like retrieval, classification, clustering, or semantic |
| 10 | +search. |
| 11 | + |
| 12 | +Note that this annotator is only supported for Spark Versions 3.4 and up. |
| 13 | + |
| 14 | +Pretrained models can be loaded with `pretrained` of the companion object: |
| 15 | + |
| 16 | +```scala |
| 17 | +val embeddings = BGEEmbeddings.pretrained() |
| 18 | + .setInputCols("document") |
| 19 | + .setOutputCol("embeddings") |
| 20 | +``` |
| 21 | + |
| 22 | +The default model is `"bge_base"`, if no name is provided. |
| 23 | + |
| 24 | +For available pretrained models please see the |
| 25 | +[Models Hub](https://sparknlp.org/models?q=BGE). |
| 26 | + |
| 27 | +For extended examples of usage, see |
| 28 | +[BGEEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/BGEEmbeddingsTestSpec.scala). |
| 29 | + |
| 30 | +**Sources** : |
| 31 | + |
| 32 | +[C-Pack: Packaged Resources To Advance General Chinese Embedding](https://arxiv.org/pdf/2309.07597) |
| 33 | + |
| 34 | +[BGE Github Repository](https://github.com/FlagOpen/FlagEmbedding) |
| 35 | + |
| 36 | +**Paper abstract** |
| 37 | + |
| 38 | +*We introduce C-Pack, a package of resources that significantly advance the field of general |
| 39 | +Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive |
| 40 | +benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive |
| 41 | +text embedding dataset curated from labeled and unlabeled Chinese corpora for training |
| 42 | +embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models |
| 43 | +outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the |
| 44 | +release. We also integrate and optimize the entire suite of training methods for C-TEM. Along |
| 45 | +with our resources on general Chinese embedding, we release our data and models for English |
| 46 | +text embeddings. The English models achieve stateof-the-art performance on the MTEB benchmark; |
| 47 | +meanwhile, our released English data is 2 times larger than the Chinese data. All these |
| 48 | +resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.* |
| 49 | +{%- endcapture -%} |
| 50 | + |
| 51 | +{%- capture input_anno -%} |
| 52 | +DOCUMENT |
| 53 | +{%- endcapture -%} |
| 54 | + |
| 55 | +{%- capture output_anno -%} |
| 56 | +SENTENCE_EMBEDDINGS |
| 57 | +{%- endcapture -%} |
| 58 | + |
| 59 | +{%- capture python_example -%} |
| 60 | +import sparknlp |
| 61 | +from sparknlp.base import * |
| 62 | +from sparknlp.annotator import * |
| 63 | +from pyspark.ml import Pipeline |
| 64 | +documentAssembler = DocumentAssembler() \ |
| 65 | + .setInputCol("text") \ |
| 66 | + .setOutputCol("document") |
| 67 | +embeddings = BGEEmbeddings.pretrained() \ |
| 68 | + .setInputCols(["document"]) \ |
| 69 | + .setOutputCol("bge_embeddings") |
| 70 | +embeddingsFinisher = EmbeddingsFinisher() \ |
| 71 | + .setInputCols(["bge_embeddings"]) \ |
| 72 | + .setOutputCols("finished_embeddings") \ |
| 73 | + .setOutputAsVector(True) |
| 74 | +pipeline = Pipeline().setStages([ |
| 75 | + documentAssembler, |
| 76 | + embeddings, |
| 77 | + embeddingsFinisher |
| 78 | +]) |
| 79 | +data = spark.createDataFrame([["query: how much protein should a female eat", |
| 80 | +"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + \ |
| 81 | +"But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + \ |
| 82 | +"marathon. Check out the chart below to see how much protein you should be eating each day.", |
| 83 | +]]).toDF("text") |
| 84 | +result = pipeline.fit(data).transform(data) |
| 85 | +result.selectExpr("explode(finished_embeddings) as result").show(5, 80) |
| 86 | ++--------------------------------------------------------------------------------+ |
| 87 | +| result| |
| 88 | ++--------------------------------------------------------------------------------+ |
| 89 | +|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...| |
| 90 | +|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...| |
| 91 | ++--------------------------------------------------------------------------------+ |
| 92 | +{%- endcapture -%} |
| 93 | + |
| 94 | +{%- capture scala_example -%} |
| 95 | +import spark.implicits._ |
| 96 | +import com.johnsnowlabs.nlp.base.DocumentAssembler |
| 97 | +import com.johnsnowlabs.nlp.annotators.Tokenizer |
| 98 | +import com.johnsnowlabs.nlp.embeddings.BGEEmbeddings |
| 99 | +import com.johnsnowlabs.nlp.EmbeddingsFinisher |
| 100 | +import org.apache.spark.ml.Pipeline |
| 101 | + |
| 102 | +val documentAssembler = new DocumentAssembler() |
| 103 | + .setInputCol("text") |
| 104 | + .setOutputCol("document") |
| 105 | + |
| 106 | +val embeddings = BGEEmbeddings.pretrained("bge_base", "en") |
| 107 | + .setInputCols("document") |
| 108 | + .setOutputCol("bge_embeddings") |
| 109 | + |
| 110 | +val embeddingsFinisher = new EmbeddingsFinisher() |
| 111 | + .setInputCols("bge_embeddings") |
| 112 | + .setOutputCols("finished_embeddings") |
| 113 | + .setOutputAsVector(true) |
| 114 | + |
| 115 | +val pipeline = new Pipeline().setStages(Array( |
| 116 | + documentAssembler, |
| 117 | + embeddings, |
| 118 | + embeddingsFinisher |
| 119 | +)) |
| 120 | + |
| 121 | +val data = Seq("query: how much protein should a female eat", |
| 122 | +"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + |
| 123 | +But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + |
| 124 | +marathon. Check out the chart below to see how much protein you should be eating each day." |
| 125 | + |
| 126 | +).toDF("text") |
| 127 | +val result = pipeline.fit(data).transform(data) |
| 128 | + |
| 129 | +result.selectExpr("explode(finished_embeddings) as result").show(1, 80) |
| 130 | ++--------------------------------------------------------------------------------+ |
| 131 | +| result| |
| 132 | ++--------------------------------------------------------------------------------+ |
| 133 | +|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...| |
| 134 | +|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...| |
| 135 | ++--------------------------------------------------------------------------------+ |
| 136 | +{%- endcapture -%} |
| 137 | + |
| 138 | +{%- capture api_link -%} |
| 139 | +[BGEEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/BGEEmbeddings) |
| 140 | +{%- endcapture -%} |
| 141 | + |
| 142 | +{%- capture python_api_link -%} |
| 143 | +[BGEEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/bge_embeddings/index.html#sparknlp.annotator.embeddings.bge_embeddings.BGEEmbeddings) |
| 144 | +{%- endcapture -%} |
| 145 | + |
| 146 | +{%- capture source_link -%} |
| 147 | +[BGEEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/BGEEmbeddings.scala) |
| 148 | +{%- endcapture -%} |
| 149 | + |
| 150 | +{% include templates/anno_template.md |
| 151 | +title=title |
| 152 | +description=description |
| 153 | +input_anno=input_anno |
| 154 | +output_anno=output_anno |
| 155 | +python_example=python_example |
| 156 | +scala_example=scala_example |
| 157 | +api_link=api_link |
| 158 | +python_api_link=python_api_link |
| 159 | +source_link=source_link |
| 160 | +%} |
0 commit comments