Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSlim into develop

lizexu123 · lizexu123 · commit c6341066792a · 2024-01-23T12:16:20.000Z
diff --git a/example/auto_compression/detection/README.md b/example/auto_compression/detection/README.md
@@ -78,16 +78,20 @@
 安装paddlepaddle：
 ```shell
 # CPU
-pip install paddlepaddle==2.4.1
-# GPU 以Ubuntu、CUDA 11.2为例
-python -m pip install paddlepaddle-gpu==2.4.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+python -m pip install paddlepaddle==2.6.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
+#GPU 以ubuntu、CUDA11.6为例
+python -m pip install paddlepaddle-gpu==2.6.0.post116 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
 ```
 
 安装paddleslim：
 ```shell
 pip install paddleslim
 ```
-
+源码安装(推荐):
+```shell
+git clone -b release/2.6 https://github.com/PaddlePaddle/PaddleSlim.git & cd PaddleSlim
+python setup.py install
+```
 安装paddledet：
 ```shell
 pip install paddledet
@@ -101,7 +105,7 @@ pip install paddledet
 
 如果数据集为非COCO格式数据，请修改[configs](./configs)中reader配置文件中的Dataset字段。
 
-以PP-YOLOE模型为例，如果已经准备好数据集，请直接修改[./configs/yolo_reader.yml]中`EvalDataset`的`dataset_dir`字段为自己数据集路径即可。
+以PP-YOLOE模型为例，如果已经准备好数据集，请直接修改[./configs/yolo_reader.yml]中`EvalDataset`和`TrainDataset'的`dataset_dir`字段为自己数据集路径即可。
 
 #### 3.3 准备预测模型
 
@@ -113,10 +117,16 @@ pip install paddledet
 根据[PaddleDetection文档](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/docs/tutorials/GETTING_STARTED_cn.md#8-%E6%A8%A1%E5%9E%8B%E5%AF%BC%E5%87%BA) 导出Inference模型，具体可参考下方PP-YOLOE模型的导出示例：
 - 下载代码
 ```
-git clone https://github.com/PaddlePaddle/PaddleDetection.git
+git clone -b release/2.6 https://github.com/PaddlePaddle/PaddleDetection.git
 ```
 - 导出预测模型
-
+- 当你使用Paddle Inference但不使用TensorRT时，运行以下命令导出模型(不包含NMS)
+```shell
+python tools/export_model.py \
+        -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml \
+        -o weights=https://paddledet.bj.bcebos.com/models/ppyoloe_crn_s_300e_coco.pdparams \
+        exclude_post_process=True \
+```
 PPYOLOE-l模型，包含NMS：如快速体验，可直接下载[PP-YOLOE-l导出模型](https://bj.bcebos.com/v1/paddle-slim-models/act/ppyoloe_crn_l_300e_coco.tar)
 ```shell
 python tools/export_model.py \
@@ -146,7 +156,7 @@ python tools/export_model.py \
 #### 3.4 自动压缩并产出模型
 
 蒸馏量化自动压缩示例通过run.py脚本启动，会使用接口```paddleslim.auto_compression.AutoCompression```对模型进行自动压缩。配置config文件中模型路径、蒸馏、量化、和训练等部分的参数，配置完成后便可对模型进行量化和蒸馏。具体运行命令为：
-
+注意!!!，ppyoloe_s_qat_dis.yaml中属性include_nms，它默认为False，如果你导出的模型有nms，则将它修改为True。
 - 单卡训练：
 ```
 export CUDA_VISIBLE_DEVICES=0
@@ -155,11 +165,10 @@ python run.py --config_path=./configs/ppyoloe_l_qat_dis.yaml --save_dir='./outpu
 
 - 多卡训练：
 ```
-CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch --log_dir=log --gpus 0,1,2,3 run.py \
-          --config_path=./configs/ppyoloe_l_qat_dis.yaml --save_dir='./output/'
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python -m paddle.distributed.launch run.py --save_dir='./rtdetr_hgnetv2_l_6x_coco_quant' --config_path=./configs/rtdetr_hgnetv2_l_qat_dis.yaml
 ```
 
-
 ## 4.预测部署
 
 #### 4.1 Paddle Inference 验证性能
@@ -178,20 +187,45 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch --log_dir=log -
 | use_mkldnn | 是否启用```MKL-DNN```加速库，注意```use_mkldnn```与```use_gpu```同时为```True```时，将忽略```enable_mkldnn```，而使用```GPU```预测  |
 | cpu_threads | CPU预测时，使用CPU线程数量，默认10  |
 | precision | 预测精度，包括`fp32/fp16/int8`  |
+| include_nms | 是否包含nms，如果不包含nms，则设置False，如果包含nms，则设置为True  |
+| use_dynamic_shape | 是否使用动态shape，如果使用动态shape，则设置为True，否则设置为False  |
+| image_shape | 输入图片的大小。这里默认为640,意味着图像将被调整到640*640  |
+| trt_calib_mode | 如果模型是通过TensorRT离线量化校准生成的，那么需要将此参数设置为True。|
 
 
 - TensorRT预测：
 
 环境配置：如果使用 TesorRT 预测引擎，需安装 ```WITH_TRT=ON``` 的Paddle，下载地址：[Python预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)
-
+带NMS的
 ```shell
 python paddle_inference_eval.py \
-      --model_path=models/ppyoloe_crn_l_300e_coco_quant \
-      --reader_config=configs/yoloe_reader.yml \
-      --use_trt=True \
-      --precision=int8
+    --model_path=ppyoloe_crn_s_300e_coco \
+    --reader_config=configs/yolo_reader.yml \
+    --use_trt=True \
+    --precision=fp16 \
+    --include_nms=True \
+    --benchmark=True
+```
+不带NMS的
+```shell
+python paddle_inference_eval.py \
+    --model_path=ppyoloe_crn_l_300e_coco \
+    --reader_config=configs/yolo_reader.yml \
+    --use_trt=True \
+    --precision=fp16 \
+    --include_nms=False \
+    --benchmark=True
+```
+- 原生GPU预测:
+```shell
+python paddle_inference_eval.py \
+    --model_path=ppyoloe_crn_s_300e_coco \
+    --reader_config=configs/yolo_reader.yml \
+    --device=GPU \
+    --precision=fp16 \
+    --include_nms=True \
+    --benchmark=True
 ```
-
 - MKLDNN预测：
 
 ```shell
@@ -206,13 +240,7 @@ python paddle_inference_eval.py \
 
 - 模型为PPYOLOE，同时不包含NMS，可以使用C++预测demo进行测速：
 
-  进入[cpp_infer](./cpp_infer_ppyoloe)文件夹内，请按照[C++ TensorRT Benchmark测试教程](./cpp_infer_ppyoloe/README.md)进行准备环境及编译，然后开始测试：
-  ```shell
-  # 编译
-  bash complie.sh
-  # 执行
-  ./build/trt_run --model_file ppyoloe_s_quant/model.pdmodel --params_file ppyoloe_s_quant/model.pdiparams --run_mode=trt_int8
-  ```
+  直接参考https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/c%2B%2B/gpu/ppyoloe_crn_l
 
 ## 5.FAQ
 
diff --git a/example/auto_compression/detection/paddle_inference_eval.py b/example/auto_compression/detection/paddle_inference_eval.py
@@ -18,6 +18,7 @@
 import sys
 import cv2
 import numpy as np
+from tqdm import tqdm
 
 import paddle
 from paddle.inference import Config
@@ -82,9 +83,15 @@ def argsparser():
     parser.add_argument("--img_shape", type=int, default=640, help="input_size")
     parser.add_argument(
         '--include_nms',
-        type=bool,
-        default=True,
+        type=str,
+        default='True',
         help="Whether include nms or not.")
+    parser.add_argument(
+        "--trt_calib_mode",
+        type=bool,
+        default=False,
+        help="If the model is produced by TRT offline quantitative "
+        "calibration, trt_calib_mode need to set True.")
 
     return parser
 
@@ -208,8 +215,9 @@ def load_predictor(
         use_mkldnn=False,
         batch_size=1,
         device="CPU",
-        min_subgraph_size=3,
+        min_subgraph_size=4,
         use_dynamic_shape=False,
+        trt_calib_mode=False,
         trt_min_shape=1,
         trt_max_shape=1280,
         trt_opt_shape=640,
@@ -238,9 +246,11 @@ def load_predictor(
     config = Config(
         os.path.join(model_dir, "model.pdmodel"),
         os.path.join(model_dir, "model.pdiparams"))
+
+    config.enable_memory_optim()
     if device == "GPU":
         # initial GPU memory(M), device ID
-        config.enable_use_gpu(200, 0)
+        config.enable_use_gpu(1000, 0)
         # optimize graph and fuse op
         config.switch_ir_optim(True)
     else:
@@ -260,12 +270,12 @@ def load_predictor(
     }
     if precision in precision_map.keys() and use_trt:
         config.enable_tensorrt_engine(
-            workspace_size=(1 << 25) * batch_size,
+            workspace_size=(1 << 30) * batch_size,
             max_batch_size=batch_size,
             min_subgraph_size=min_subgraph_size,
             precision_mode=precision_map[precision],
             use_static=True,
-            use_calib_mode=False, )
+            use_calib_mode=False)
 
         if use_dynamic_shape:
             dynamic_shape_file = os.path.join(FLAGS.model_path,
@@ -297,6 +307,7 @@ def predict_image(predictor,
     img, scale_factor = image_preprocess(image_file, image_shape)
     inputs = {}
     inputs["image"] = img
+
     if FLAGS.include_nms:
         inputs['scale_factor'] = scale_factor
     input_names = predictor.get_input_names()
@@ -356,7 +367,8 @@ def eval(predictor, val_loader, metric, rerun_flag=False):
     boxes_tensor = predictor.get_output_handle(output_names[0])
     if FLAGS.include_nms:
         boxes_num = predictor.get_output_handle(output_names[1])
-    for batch_id, data in enumerate(val_loader):
+    for batch_id, data in tqdm(
+            enumerate(val_loader), total=len(val_loader), desc='Evaluating'):
         data_all = {k: np.array(v) for k, v in data.items()}
         for i, _ in enumerate(input_names):
             input_tensor = predictor.get_input_handle(input_names[i])
@@ -382,7 +394,6 @@ def eval(predictor, val_loader, metric, rerun_flag=False):
             res = {'bbox': np_boxes, 'bbox_num': np_boxes_num}
         metric.update(data_all, res)
         if batch_id % 100 == 0:
-            print("Eval iter:", batch_id)
             sys.stdout.flush()
     metric.accumulate()
     metric.log()
@@ -421,7 +432,6 @@ def main():
             repeats=repeats)
     else:
         reader_cfg = load_config(FLAGS.reader_config)
-
         dataset = reader_cfg["EvalDataset"]
         global val_loader
         val_loader = create("EvalReader")(
@@ -432,6 +442,7 @@ def main():
         anno_file = dataset.get_anno()
         metric = COCOMetric(
             anno_file=anno_file, clsid2catid=clsid2catid, IouType="bbox")
+
         eval(predictor, val_loader, metric, rerun_flag=rerun_flag)
 
     if rerun_flag:
@@ -444,8 +455,12 @@ def main():
     paddle.enable_static()
     parser = argsparser()
     FLAGS = parser.parse_args()
+    if FLAGS.include_nms == 'True':
+        FLAGS.include_nms = True
+    else:
+        FLAGS.include_nms = False
 
     # DataLoader need run on cpu
     paddle.set_device("cpu")
 
-    main()
+    main()
diff --git a/example/auto_compression/detection/post_process.py b/example/auto_compression/detection/post_process.py
@@ -41,8 +41,7 @@ def hard_nms(box_scores, iou_threshold, top_k=-1, candidate_size=200):
         rest_boxes = boxes[indexes, :]
         iou = iou_of(
             rest_boxes,
-            np.expand_dims(
-                current_box, axis=0), )
+            np.expand_dims(current_box, axis=0), )
         indexes = indexes[iou <= iou_threshold]
 
     return box_scores[picked, :]
@@ -122,7 +121,7 @@ def _non_max_suppression(self, prediction, scale_factor):
                 picked_labels.extend([class_index] * box_probs.shape[0])
 
             if len(picked_box_probs) == 0:
-                out_boxes_list.append(np.empty((0, 4)))
+                out_boxes_list.append(np.empty((0, 6)))
 
             else:
                 picked_box_probs = np.concatenate(picked_box_probs)
@@ -135,9 +134,8 @@ def _non_max_suppression(self, prediction, scale_factor):
                 # clas score box
                 out_box = np.concatenate(
                     [
-                        np.expand_dims(
-                            np.array(picked_labels), axis=-1), np.expand_dims(
-                                picked_box_probs[:, 4], axis=-1),
+                        np.expand_dims(np.array(picked_labels), axis=-1),
+                        np.expand_dims(picked_box_probs[:, 4], axis=-1),
                         picked_box_probs[:, :4]
                     ],
                     axis=1)
@@ -152,6 +150,6 @@ def _non_max_suppression(self, prediction, scale_factor):
         return out_boxes_list, box_num_list
 
     def __call__(self, outs, scale_factor):
-        out_boxes_list, box_num_list = self._non_max_suppression(outs,
-                                                                 scale_factor)
+        out_boxes_list, box_num_list = self._non_max_suppression(
+            outs, scale_factor)
         return {'bbox': out_boxes_list, 'bbox_num': box_num_list}
diff --git a/example/post_training_quantization/pytorch_yolo_series/README.md b/example/post_training_quantization/pytorch_yolo_series/README.md
@@ -122,7 +122,7 @@ python eval.py --config_path=./configs/yolov5s_ptq.yaml
 #### 3.6 提高离线量化精度
 
 ###### 3.6.1 量化分析工具
-本节介绍如何使用量化分析工具提升离线量化精度。离线量化功能仅需使用少量数据，且使用简单、能快速得到量化模型，但往往会造成较大的精度损失。PaddleSlim提供量化分析工具，会使用接口```paddleslim.quant.AnalysisPTQ```，可视化展示出不适合量化的层，通过跳过这些层，提高离线量化模型精度。```paddleslim.quant.AnalysisPTQ```详解见[AnalysisPTQ.md](../../../docs/zh_cn/tutorials/quant/AnalysisPTQ.md)。
+本节介绍如何使用量化分析工具提升离线量化精度。离线量化功能仅需使用少量数据，且使用简单、能快速得到量化模型，但往往会造成较大的精度损失。PaddleSlim提供量化分析工具，会使用接口```paddleslim.quant.AnalysisPTQ```，可视化展示出不适合量化的层，通过跳过这些层，提高离线量化模型精度。```paddleslim.quant.AnalysisPTQ```详解见[AnalysisPTQ.md](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/quant/post_training_quantization.md)。
 
 
 由于YOLOv6离线量化效果较差，以YOLOv6为例，量化分析工具具体使用方法如下：
@@ -207,7 +207,70 @@ python fine_tune.py --config_path=./configs/yolov6s_fine_tune.yaml --simulate_ac
 
 ## 4.预测部署
 预测部署可参考[YOLO系列模型自动压缩示例](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/auto_compression/pytorch_yolo_series)
-
-
+量化模型在GPU上可以使用TensorRT进行加速，在CPU上可以使用MKLDNN进行加速。
+| 参数名 |  含义  |
+| model_path | inference模型文件所在路径，该目录下需要有文件model.pdmodel和params.pdiparams两个文件 |
+| dataset_dir | 指定COCO数据集的目录，这是存储数据集的根目录 |
+| image_file | 如果只测试单张图片效果，直接根据image_file指定图片路径 |
+| val_image_dir | COCO数据集中验证图像的目录名，默认为val2017 |
+| val_anno_path | 指定COCO数据集的注释(annotation)文件路径，这是包含验证集标注信息的JSON文件，默认为annotations/instances_val2017.json |
+| benchmark | 指定是否运行性能基准测试。如果设置为True，程序将会进行性能测试 |
+| device | 使用GPU或者CPU预测，可选CPU/GPU/XPU，默认设置为GPU |
+| use_trt | 是否使用TensorRT进行预测|
+| use_mkldnn | 是否使用MKL-DNN加速库，注意use_mkldnn与use_gpu同时为True时,将忽略enable_mkldnn,而使用GPU预测|
+| use_dynamic_shape | 是否使用动态形状(dynamic_shape)功能 |
+| precision | fp32/fp16/int8|
+| arch | 指定所使用的模型架构的名称，例如YOLOv5 |
+| img_shape | 指定模型输入的图像尺寸 |
+| batch_size | 指定模型输入的批处理大小 |
+| use_mkldnn | 指定是否使用MKLDNN加速(主要针对CPU)|
+| cpu_threads | 指定在CPU上使用的线程数 |
+
+首先，我们拥有的yolov6.onnx，我们需要把ONNX模型转成paddle模型，具体参考使用[X2Paddle迁移推理模型](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/model_convert/convert_with_x2paddle_cn.html#x2paddle)
+- 安装X2Paddle
+方式一:pip 安装
+```shell
+pip install X2Paddle==1.3.9
+```
+方式二:源码安装
+```shell
+git clone https://github.com/PaddlePaddle/X2Paddle.git
+cd X2Paddle
+python setup.py install
+```
+使用命令将YOLOv6.onnx模型转换成paddle模型
+```shell
+x2paddle --framework=onnx --model=yolov6s.onnx --save_dir=yolov6_model
+```
+- TensorRT Python部署
+使用[paddle_inference_eval.py](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/example/auto_compression/pytorch_yolo_series/paddle_inference_eval.py)部署
+```shell
+python paddle_inference_eval.py --model_path=yolov6_model/inference_model --dataset_dir=datasets/coco --use_trt=True --precision=fp32 --arch=YOLOv6
+```
+执行int8量化
+```shell
+python paddle_inference_eval.py --model_path=yolov6s_ptq_out --dataset_dir==datasets/coco --use_trt=True --precision=int8 --arch=YOLOv6
+```
+- C++部署
+具体可参考[运行PP-YOLOE-l目标检测模型样例](https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/c%2B%2B/gpu/ppyoloe_crn_l)
+将compile.sh中DEMO_NAME修改为yolov6_test，并且将ppyoloe_crn_l.cc修改为yolov6_test.cc,根据环境修改相关配置库
+运行bash compile.sh编译样例。
+- 运行样例
+-使用原生GPU运行样例(将ONNX模型转成的paddle模型复制到Paddle-Inference-demo/c++/gpu/ppyoloe_crn_l/目录下)
+```shell
+./build/yolov6_test --model_file yolov6s_infer/model.pdmodel --params_file yolov6s_infer/model.pdiparams
+```
+- 使用TensorRT FP32运行样例
+```shell
+./build/yolov6_test --model_file yolov6s_infer/model.pdmodel --params_file yolov6s_infer/model.pdiparams --run_mode=trt_fp32
+```
+- 使用TensorRT FP16运行样例
+```shell
+./build/yolov6_test --model_file yolov6s_infer/model.pdmodel --params_file yolov6s_infer/model.pdiparams --run_mode=trt_fp16
+```
+- 使用TensorRT INT8运行样例
+```shell
+./build/yolov6_test --model_file yolov6s_infer/model.pdmodel --params_file yolov6s_infer/model.pdiparams --run_mode=trt_int8
+```
 ## 5.FAQ
 - 如果想对模型进行自动压缩，可进入[YOLO系列模型自动压缩示例](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/auto_compression/pytorch_yolo_series)中进行实验。
diff --git a/paddleslim/quant/advanced/auto_clip.py b/paddleslim/quant/advanced/auto_clip.py
diff --git a/paddleslim/quant/advanced/piecewise_search.py b/paddleslim/quant/advanced/piecewise_search.py