Skip to content

Commit cd5a39b

Browse files
authored
[SPARKNLP-1189] Introducing Phi4 (#14606)
* Add Phi-4 model implementation and tokenizer support - Introduced `Phi4` class for the state-of-the-art Phi-4 model by Microsoft Research, including methods for encoding, decoding, and tagging. - Added `Phi4Transformer` for integration with Spark NLP, enabling pretrained model loading and configuration. - Implemented `Phi4Tokenizer` for byte pair encoding specific to the Phi-4 model. - Updated `BpeTokenizer` to support the new Phi-4 tokenizer. - Added tests for `Phi4Transformer` to ensure functionality and performance. - Updated resource downloader to include `Phi4Transformer` for pretrained model access. * Enhance documentation for Phi-4 model in `Phi4Transformer.scala` - Expanded the model description to include detailed information on parameters, intended use, benchmarks, safety, limitations, and usage instructions. - Improved formatting for better readability and clarity. - Added references for further information on the Phi-4 model. * Add Phi-4 Transformer implementation and integration - Introduced `Phi4Transformer` class for the state-of-the-art Phi-4 model by Microsoft Research, enabling advanced reasoning and NLP tasks. - Added support for loading pretrained models and configuration parameters. - Implemented a loader for the Phi-4 model in the internal module. - Created unit tests to validate the functionality of the `Phi4Transformer`. - Updated the main `__init__.py` to include the new transformer in the module exports. * Add documentation and example notebook for Phi-4 Transformer - Created a detailed markdown file for the `Phi4Transformer`, outlining its features, usage, and pretrained model loading instructions. - Added a Jupyter notebook demonstrating the integration of the Phi-4 model with Spark NLP and Intel OpenVINO, including installation steps and example code. - Enhanced the Scala implementation comments for better clarity and formatting. - Updated the test specifications to reflect changes in the handling of temperature settings during predictions.
1 parent ccf7606 commit cd5a39b

File tree

12 files changed

+3844
-1
lines changed

12 files changed

+3844
-1
lines changed
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
{%- capture title -%}
2+
Phi4Transformer
3+
{%- endcapture -%}
4+
5+
{%- capture description -%}
6+
Text Generation using Microsoft Phi-4.
7+
8+
Phi4Transformer loads the state-of-the-art Phi-4 model for advanced reasoning, code generation, and general NLP tasks. Phi-4 is a 14B parameter, dense decoder-only Transformer model trained on 9.8T tokens, with a 16K context length, and is best suited for prompts in chat format. The model is designed for high-quality, advanced reasoning, math, code, and general NLP, with a focus on English (multilingual data ~8%).
9+
10+
**Key Features:**
11+
- 14B parameters, dense decoder-only Transformer
12+
- 16K context length
13+
- Trained on 9.8T tokens (synthetic, public domain, academic, Q&A, code)
14+
- Benchmarks: MMLU 84.8, HumanEval 82.6, GPQA 56.1, DROP 75.5, MATH 80.6
15+
- Safety alignment via SFT and DPO, red-teamed by Microsoft AIRT
16+
- Released under MIT License
17+
18+
Pretrained models can be loaded with `pretrained` of the companion object:
19+
20+
```scala
21+
val phi4 = Phi4Transformer.pretrained()
22+
.setInputCols("document")
23+
.setOutputCol("generation")
24+
```
25+
The default model is `"phi-4"`, if no name is provided.
26+
27+
For available pretrained models please see the [Models Hub](https://huggingface.co/microsoft/phi-4).
28+
29+
To see which models are compatible and how to import them see [Import Transformers into Spark NLP 🚀](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669).
30+
{%- endcapture -%}
31+
32+
{%- capture input_anno -%}
33+
DOCUMENT
34+
{%- endcapture -%}
35+
36+
{%- capture output_anno -%}
37+
DOCUMENT
38+
{%- endcapture -%}
39+
40+
{%- capture python_example -%}
41+
import sparknlp
42+
from sparknlp.base import *
43+
from sparknlp.annotator import *
44+
from pyspark.ml import Pipeline
45+
46+
documentAssembler = DocumentAssembler() \
47+
.setInputCol("text") \
48+
.setOutputCol("documents")
49+
50+
phi4 = Phi4Transformer.pretrained("phi-4") \
51+
.setInputCols(["documents"]) \
52+
.setMaxOutputLength(60) \
53+
.setOutputCol("generation")
54+
55+
pipeline = Pipeline().setStages([documentAssembler, phi4])
56+
57+
data = spark.createDataFrame([
58+
(1, "<|im_start|>system<|im_sep|>\nYou are a helpful assistant!\n<|im_start|>user<|im_sep|>\nWhat is Phi-4?\n<|im_start|>assistant<|im_sep|>\n")
59+
]).toDF("id", "text")
60+
61+
result = pipeline.fit(data).transform(data)
62+
result.select("generation.result").show(truncate=False)
63+
{%- endcapture -%}
64+
65+
{%- capture scala_example -%}
66+
import spark.implicits._
67+
import com.johnsnowlabs.nlp.base._
68+
import com.johnsnowlabs.nlp.annotator._
69+
import org.apache.spark.ml.Pipeline
70+
71+
val documentAssembler = new DocumentAssembler()
72+
.setInputCol("text")
73+
.setOutputCol("documents")
74+
75+
val phi4 = Phi4Transformer.pretrained("phi-4")
76+
.setInputCols(Array("documents"))
77+
.setMaxOutputLength(60)
78+
.setOutputCol("generation")
79+
80+
val pipeline = new Pipeline().setStages(Array(documentAssembler, phi4))
81+
82+
val data = Seq(
83+
(1, "<|im_start|>system<|im_sep|>\nYou are a helpful assistant!\n<|im_start|>user<|im_sep|>\nWhat is Phi-4?\n<|im_start|>assistant<|im_sep|>\n")
84+
).toDF("id", "text")
85+
86+
val result = pipeline.fit(data).transform(data)
87+
result.select("generation.result").show(truncate = false)
88+
{%- endcapture -%}
89+
90+
{%- capture api_link -%}
91+
[Phi4Transformer](/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi4Transformer.html)
92+
{%- endcapture -%}
93+
94+
{%- capture python_api_link -%}
95+
[Phi4Transformer](/api/python/reference/autosummary/sparknlp/annotator/seq2seq/phi4_transformer/index.html)
96+
{%- endcapture -%}
97+
98+
{%- capture source_link -%}
99+
[Phi4Transformer](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi4Transformer.scala)
100+
{%- endcapture -%}
101+
102+
{% include templates/anno_template.md
103+
title=title
104+
description=description
105+
input_anno=input_anno
106+
output_anno=output_anno
107+
python_example=python_example
108+
scala_example=scala_example
109+
api_link=api_link
110+
python_api_link=python_api_link
111+
source_link=source_link
112+
%}

0 commit comments

Comments
 (0)