Skip to content

Commit dfa4883

Browse files
authored
[docs] quant_kwargs (#11712)
* draft * update
1 parent 94df8ef commit dfa4883

File tree

3 files changed

+20
-14
lines changed

3 files changed

+20
-14
lines changed

docs/source/en/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,7 @@
179179
isExpanded: false
180180
sections:
181181
- local: quantization/overview
182-
title: Getting Started
182+
title: Getting started
183183
- local: quantization/bitsandbytes
184184
title: bitsandbytes
185185
- local: quantization/gguf

docs/source/en/api/quantization.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,19 +27,19 @@ Learn how to quantize models in the [Quantization](../quantization/overview) gui
2727

2828
## BitsAndBytesConfig
2929

30-
[[autodoc]] BitsAndBytesConfig
30+
[[autodoc]] quantizers.quantization_config.BitsAndBytesConfig
3131

3232
## GGUFQuantizationConfig
3333

34-
[[autodoc]] GGUFQuantizationConfig
34+
[[autodoc]] quantizers.quantization_config.GGUFQuantizationConfig
3535

3636
## QuantoConfig
3737

38-
[[autodoc]] QuantoConfig
38+
[[autodoc]] quantizers.quantization_config.QuantoConfig
3939

4040
## TorchAoConfig
4141

42-
[[autodoc]] TorchAoConfig
42+
[[autodoc]] quantizers.quantization_config.TorchAoConfig
4343

4444
## DiffusersQuantizer
4545

docs/source/en/quantization/overview.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,27 +11,33 @@ specific language governing permissions and limitations under the License.
1111
1212
-->
1313

14-
# Quantization
14+
# Getting started
1515

1616
Quantization focuses on representing data with fewer bits while also trying to preserve the precision of the original data. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.
1717

1818
Diffusers supports multiple quantization backends to make large diffusion models like [Flux](../api/pipelines/flux) more accessible. This guide shows how to use the [`~quantizers.PipelineQuantizationConfig`] class to quantize a pipeline during its initialization from a pretrained or non-quantized checkpoint.
1919

2020
## Pipeline-level quantization
2121

22-
There are two ways you can use [`~quantizers.PipelineQuantizationConfig`] depending on the level of control you want over the quantization specifications of each model in the pipeline.
22+
There are two ways to use [`~quantizers.PipelineQuantizationConfig`] depending on how much customization you want to apply to the quantization configuration.
2323

24-
- for more basic and simple use cases, you only need to define the `quant_backend`, `quant_kwargs`, and `components_to_quantize`
25-
- for more granular quantization control, provide a `quant_mapping` that provides the quantization specifications for the individual model components
24+
- for basic use cases, define the `quant_backend`, `quant_kwargs`, and `components_to_quantize` arguments
25+
- for granular quantization control, define a `quant_mapping` that provides the quantization configuration for individual model components
2626

27-
### Simple quantization
27+
### Basic quantization
2828

2929
Initialize [`~quantizers.PipelineQuantizationConfig`] with the following parameters.
3030

3131
- `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`.
32-
- `quant_kwargs` contains the specific quantization arguments to use.
32+
- `quant_kwargs` specifies the quantization arguments to use.
33+
34+
> [!TIP]
35+
> These `quant_kwargs` arguments are different for each backend. Refer to the [Quantization API](../api/quantization) docs to view the arguments for each backend.
36+
3337
- `components_to_quantize` specifies which components of the pipeline to quantize. Typically, you should quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact.
3438

39+
The example below loads the bitsandbytes backend with the following arguments from [`~quantizers.quantization_config.BitsAndBytesConfig`], `load_in_4bit`, `bnb_4bit_quant_type`, and `bnb_4bit_compute_dtype`.
40+
3541
```py
3642
import torch
3743
from diffusers import DiffusionPipeline
@@ -56,13 +62,13 @@ pipe = DiffusionPipeline.from_pretrained(
5662
image = pipe("photo of a cute dog").images[0]
5763
```
5864

59-
### quant_mapping
65+
### Advanced quantization
6066

61-
The `quant_mapping` argument provides more flexible options for how to quantize each individual component in a pipeline, like combining different quantization backends.
67+
The `quant_mapping` argument provides more options for how to quantize each individual component in a pipeline, like combining different quantization backends.
6268

6369
Initialize [`~quantizers.PipelineQuantizationConfig`] and pass a `quant_mapping` to it. The `quant_mapping` allows you to specify the quantization options for each component in the pipeline such as the transformer and text encoder.
6470

65-
The example below uses two quantization backends, [`~quantizers.QuantoConfig`] and [`transformers.BitsAndBytesConfig`], for the transformer and text encoder.
71+
The example below uses two quantization backends, [`~quantizers.quantization_config.QuantoConfig`] and [`transformers.BitsAndBytesConfig`], for the transformer and text encoder.
6672

6773
```py
6874
import torch

0 commit comments

Comments
 (0)