4. 指令说明

Profiler 性能调试工具指令如下：

tcim-profile [OPTION1 [ARG1]] [OPTION2 [ARG2]] [OPTION3 [ARG3]] ...

指令参数说明如下：

-f, --json_path：指定用于性能数据解析的文件路径。使用该参数后，其他参数不可同时使用。
-m, --model_path：性能评测的模型文件路径。
-w, --weight_path：模型推理所需的权重文件，文件名为 .npy。
-t, --target：模型推理使用的后摩设备。默认值为 xh1。用户需设置为 xh2 表示在后摩M50设备上推理模型。
--output_name：性能评测的模型名称。默认值为 profile_model。
--output_dir：存放模型编译结果，以及性能评测相关数据和结果的目录。默认值为当前路径下的 output/xh1 中，性能评测的数据和结果放在该目录下。
--work_dir：存放模型编译中间结果的目录。默认值为 ./output/workspace。
--batch：模型编译和推理时的batch数。
--ncore：模型推理时使用的IPU内核数。默认值为 1。支持取值为 1、2、4。
--opt_level：模型编译优化等级。默认值为 O2。支持取值为 O0、O1、O2。
--device-id：指定模型推理使用的后摩设备逻辑ID。默认值为 0。用户可通过后摩SMI工具获取后摩设备逻辑ID，详情参看《SMI工具使用指南》。

5. 评测方法

Profiler 性能调试工具支持两种性能评测方法，适用于不同场景需求。

5.1. 方法一：单指令完成模型编译、推理及性能解析

该方法通过 tcim-profile 指令完成模型的编译、推理及性能数据解析的全流程操作。由于模型编译仅支持在 Linux 主机上完成，因此该方法只能在 Linux 环境中运行。用户需通过命令行指定编译和推理的相关参数。

进入软件平台提供的 docker 镜像，执行 tcim-profile 指令，示例如下：

注意

仅支持在后摩硬件设备上运行，无法在模拟器中执行。为确保在硬件平台上运行，请将环境变量设置为 export HDPL_PLATFORM=ASIC。

tcim-profile -m <quantized_onnx_model> --output_name <output_model_name> -t xh2

其中，quantized_onnx_model 应替换为量化后模型，output_model_name 应替换为模型名称。

数据将以表格形式输出至命令行终端，更多详情参看性能数据分析。

5.2. 方法二：分步生成性能数据再解析性能

该方法通过在模型编译和推理过程中生成必要的性能数据和相关信息，随后使用 Profiler 工具进行性能解析。该方法适用于多平台性能对比，能够对不同平台上的推理性能进行评测与比较。

调用 build_from_hmonnx 接口编译模型，并设置 enable_profile 参数为 True。示例如下：

import tcim
model_path = "hmquant_houmo_tcim_yolov5s.onnx"
onnx_model = onnx.load(model_path)
tcim.build_from_hmonnx(model_path, output_dir="./output", work_dir="./output/workspace", target="xh2", opt_level="O2", enable_profile=True)

模型编译过程中，将生成模型信息文件 profile_spec.json，默认保存在当前路径下的 output/workspace/profile 目录中。

在目标平台上推理模型。下面示例中，通过 module.get_output(output_name) 获取对应输出数据，并在保存前将其转换为 int8 类型，再写入为 bin 文件。

主要推理部分示例如下：

import tcim_lite

# 1. load compiled model
module = tcim_lite.runtime.load("output/model.hmm")

# 2. preprocess
yolov5 = YoloV5()
img_path = "../images/cat.jpg"
cv_image = cv2.imread(img_path)
input_data = yolov5.preprocess(cv_image)
input_data = torch.tensor(input_data, dtype=torch.float32)
input_data = torch.squeeze(input_data, 0)
input_data = input_data.permute(2, 0, 1)
input_data = torch.unsqueeze(input_data, 0)  # NCHW float32
input_data = input_data.numpy().astype(np.float32)

# 3. set input
input_num = module.get_num_inputs()
for id in range(input_num):
    input_name = module.get_input_name(id)
    input_info = module.get_input_info(input_name).ascontiguous()
    module.set_input(input_name, input_data)

# 4. infer model
module.run()
module.sync()

# 5. get output
output_num = module.get_num_outputs()
found_auto_profile = False
found_primitive_profile = False
for id in range(output_num):
    output_name = module.get_output_name(id)
    output_info = module.get_output_info(output_name).astype(np.float32).ascontiguous()
    output_data = module.get_output(output_name).astype(np.float32).numpy()
    output_data_path = os.path.join(model_dir, 'hmquant_' + model_name + '_' + output_name + '_output.npy')
    if output_name == "auto_profile_data.bin":
        save_path = os.path.join("./output/workspace/profile", "auto_profile_data.bin")
        auto_profile_data = module.get_output(output_name).to_host().astype(np.int8).numpy()
        auto_profile_data.tofile(save_path)
        found_auto_profile = True
        print(f"Saved {output_name} data to {save_path}")

    if output_name == "primitive_profile_data.bin":
        save_path = os.path.join("./output/workspace/profile", "primitive_profile_data.bin")
        primitive_profile_data = module.get_output(output_name).to_host().astype(np.int8).numpy()
        primitive_profile_data.tofile(save_path)
        found_primitive_profile = True
        print(f"Saved {output_name} data to {save_path}")

if not found_auto_profile:
    print("Warning: 'auto_profile_data.bin' not found in model outputs.")
if not found_primitive_profile:
    print("Warning: 'primitive_profile_data.bin' not found in model outputs.")

在推理完成后，auto_profile_data.bin 和 primitive_profile_data.bin 会作为输出名称 output_name 出现在结果中。

仅当模型在 IPU 上推理过程中使用到宏指令算子时，才会生成 auto_profile_data.bin 文件。
仅当模型在 IPU 上推理过程中使用到 RVV 算子时，才会生成 primitive_profile_data.bin 文件。

示例中，设置生成的 bin 文件保存在当前路径下的 output/workspace/profile 目录中，与 profile_spec.json 存放在同一路径下。

小技巧

为保证 Profiler 性能调试工具正常运行，建议将 auto_profile_data.bin 和 primitive_profile_data.bin 与 profile_spec.json 放在同一目录下。如果放在不同目录，需要在 profile_spec.json 中将 profile_data_file 字段设置为对应的文件路径。

调用Profiler性能调试工具对性能数据进行分析，需指定 profile_spec.json，示例如下：
```
cd output/workspace/profile
tcim-profile -f profile_spec.json -t xh2
```
数据将以表格形式输出至命令行终端，更多详情参看性能数据分析。

上面示例展示关键步骤代码，仅供参考，不可以直接拷贝运行。

5.2.1. 注意事项

模型编译完成后，会自动生成 profile_spec.json 文件以及相关输出文件，并在 profile_spec.json 文件中写入对应的路径配置。默认情况下，工具会使用 profile_spec.json 中记录的路径来访问相关输出文件，因此用户通常无需修改文件存放位置。如果更改了输出文件的存放路径，用户需要同时在 profile_spec.json 中更新对应字段，以确保工具能够正确找到文件。相关字段包括：

profile_data_file
output_dir
profile_spec_path
device_check_out
device_check
cpp_file

profile_spec.json 文件示例如下：

{
  "auto_profile_spec": {
    "head_tag": "",
    "head_fmt": "QQQQQQQQ",
    "record_fmt": "QQ",
    "tail_fmt": "",
    "tail_tag": "",
    "profile_data_size": 286208,
    "profile_data_file": "auto_profile_data.bin"
  },
  "core_num": 1,
  "round_num": 1,
  "build_type": "public",
  "target": "xh2",
  "output_dir": "/usr/local/src/yolov5s/output/xh2/workspace/profile",
  "profile_spec_path": "/usr/local/src/yolov5s/output/xh2/workspace/profile/profile_spec.json",
  "device_check_out": "/usr/local/src/yolov5s/output/xh2/workspace/output.device.out",
  "device_check": "/usr/local/src/yolov5s/output/xh2/workspace/yolo.device",
  "intrinsic_mode": true,
  "analyze_ddr_bandwidth_usage": true,
  "profile_primitive_operator": "primitive_hmir_kv_cache",
  "cpp_file": "/usr/local/src/yolov5s/output/xh2/workspace/output.cpp",
  "hmcc_op_perf": "/usr/local/src/yolov5s/output/xh2/workspace/profile/hmcc_op_perf.json",
  "output_mlir": "/usr/local/src/yolov5s/",
}

6. 性能数据分析

Profiler 性能调试工具通过表格方式展示了IPU内核指令性能数据，以便用户进行详细的分析与优化。

6.1. IPU内核指令性能分析

6.1.1. 表格展示

表格展示了推理过程中，每个IPU内核上，每个tile中每种类型的指令如vector、store、dma_channel等执行的总周期数（cycles）和执行时间（us）。表格以tile为单位，包括RISC-V指令和IPU内核指令。

下面示例展示了IPU内核tile0上的性能数据：

==============================================
Summary of tile0
dma_channel_0: 376466 cycles 268.904 us
dma_channel_1: 92748 cycles 66.249 us
dma_channel_2: 72464 cycles 51.760 us
dma_channel_3: 50936 cycles 36.383 us
idle: 295935 cycles 211.382 us
load1: 535876 cycles 382.769 us
nl: 206011 cycles 147.151 us
store: 276 cycles 0.197 us
tensor_feature: 956256 cycles 683.040 us
vector512_vp0: 325960 cycles 232.829 us
==============================================

6.2. DDR带宽性能分析

6.2.1. 表格展示

表格展示了推理过程中，DDR的总带宽（Total ddr usage）、平均带宽（Avg bandwidth）和峰值带宽（Max bandwidth）。

下面示例展示了DDR带宽数据统计情况：

===========DDR Usage Summary===========
Total ddr usage is  29.524 MB (in  0.001838s)
Avg bandwidth is  16062.779 MB/s
Max bandwidth is  82193.925 MB/s
=======================================