tcim.builder.api

build_from_hmonnx

build_from_hmonnx(onnx_model: Union[str, onnx.ModelProto],
                  output_name: str = "model",
                  output_dir: str = "./output",
                  work_dir: Union[str, None] = None,
                  target: str = "xh1",
                  ncore: int = 1,
                  opt_level: str = "O2",
                  **kwargs: Union[dict, None],)

Builds a TCIM model.

Parameters

onnx_model (str or onnx.ModelProto)-- The ONNX model object to be built.
output_name (str)-- The name of the binary model file (.hmm) generated after model compilation. Default value is "model".
output_dir (str)-- The directory where the generated binary model file is saved. Defaults to the "./output".
work_dir (Optional[str])-- The directory for saving intermediate files generated during the compilation. Defaults to the ./output/workspace.
target (Optional[str])-- The Houmo device hardware used for model inference. Supported values are:
- xh1: Represents Houmo M30 or H30 devices.
- xh2: Represents Houmo M50 devices.
Default value is "xh1".
ncore (int)-- The number of IPU cores used for model inference. Supported values are 1, 2, and 4. Defaults to 1.
opt_level (str)-- The optimization level applied during compilation. Supported values are O0, O1, and O2. Default to O2. Higher optimization levels, such as O2, provide better performance and improved runtime efficiency, but result in longer compilation times. For faster compilation, typically used for testing, set this parameter to O0.
kwargs (Optional[dict])-- Additional configuration options. See "Other Parameters" section for details.

Other Parameters

weights (Optional[str, numpy.NDArray])-- The input weights for large models (size of onnx_model >= 2GB) where weights and the ONNX model are stored separately. Defaults to None.
batch (int)-- The batch size for model inference and dynamic image resizing. Defaults to 1.
- If only for model compilation and inference, it represents the batch size for model inference.
  
  If set to 1, the input shape and batch size of the ONNX model is used. If set to a value greater than 1, the final batch size is calculated as:
  
  final_batch_size = (value of this parameter) x (batch size of the original ONNX model)
  
  The first dimension of the original ONNX input shape is treated as the batch (N) dimension. If the first dimension is not batch, an error occurred.
- If the dynamic image resizing function is used:
  - If one_img_multi_roi is set to True, the batch size indicates the number of ROIs (Regions of Interest) to be resized from a single input image.
  - If one_img_multi_roi is set to False, each image is resized into a single ROI based on its individual ROI configuration. The batch size indicates both the number of input images and the number of ROIs.
Attention

This parameter applies to non-LLM models only. For LLM models, use modify_llm.batch to configure the batch size.
one_img_multi_roi (bool)-- Specifies if to enable dynamic resizing of a single input image into multiple ROIs. Defauls to False. When set to True, an input image is resized into multiple batches, and each ROI is resized independently. The number of ROIs is determined by the value specified in the batch parameter.
modify_llm (dict)-- Dictionary to customize the batch size, context length, and input sequence length of a LLM during compilation. In the original model, these parameters are fixed and cannot be changed during inference. Using this parameter, the compiled model can support user-defined values for these parameters. The dictionary includes the following keys:
- batch (int)-- Batch size for the decode model. Supported values are 1, 2, 4, and 8.
- context-length (int)-- The maximum context length for both prefill and decode models, in tokens. Supported values are 256, 512, 1024, 2048, 4096, 8192, 16384, and 32768.
- fill-length (int)-- The maximum sequence length to handle in a single forward pass in prefill phrase, in tokens. Supported values are 128, 256, and 512.
ndevice (int)-- The number of devices used for model inference. Supported values are 1, 2. Defaults to 1.
io_layout (str)-- The io_layout parameter can be a string (str) with the default value "stdv1". If it is a string, it only accepts "stdv1" or "any", which represent that all inputs and outputs use "stdv1" or "any".

If it is set to "stdv1", it means that all inputs and outputs can be used as inputs or outputs for other models on the device side. This will improve the efficiency of the pipeline but reduce the performance of this model. If it is set to "any", it means that all inputs and outputs cannot be used as inputs or outputs for other models on the device side, but it will improve the performance of this model. The default value is "stdv1".
enable_profile (bool)-- Specifies if to enable Profiler tool for performance analysis. Defaults to False. When set to True, the Profiler tool will record and analyze performance data for instructions executed on the IPU cores of the Houmo device. For more information, see "Profiler Tool User Guide".
enable_dynamic_image_resize (bool)-- Specifies if to enable dynamic resizing for a single input image. Defaults to False. When set to True, the dynamic resizing for input image is enabled. The generated binary model file will include a new input tensor dyn_info. Before model inference, you can set the value for each dimension of the dyn_info tensor to specify the resizing configurations. This parameter is only supported when the batch size is 1.

The data type of this tensor is int32, and its shape is [10]. The elements of this tensor are defined as [cropY, cropX, crop_height, crop_width, resize_height, resize_width, pad_top, pad_left, pad_bottom, pad_right], where:
- cropY: Vertical coordinate (in pixels) of the top-left corner of the cropping region.
- cropX: Horizontal coordinate (in pixels) of the top-left corner of the cropping region.
- crop_height: The height (in pixels) of the cropping region.
- crop_width: The width (in pixels) of the cropping region.
- resize_height: Target height (in pixels) after resizing.
- resize_width: Target width (in pixels) after resizing.
- pad_top: Vertical padding (in pixels) applied to the top of the image.
- pad_left: Horizontal padding (in pixels) applied to the left side of the image.
- pad_bottom: Vertical padding (in pixels) applied to the bottom side cof the image.
- pad_right: Horizontal padding (in pixels) applied to the right side ofthe image.
Note: Make sure the following constraints are met:
- dstH = outH - pad_top - pad_bottom, where outH is the original input height.
- dstW = outw - pad_left - pad_right, where outW is the original input width.
llm_opt (bool)-- Enables performance optimization for LLM (Large Language Model) inference. Defaults to False. Set to True only when performing inference with LLM models.
j (int)-- The number of CPU cores used for model compilation. Defaults to using all available CPU cores.
enable_common_subgraph (bool)-- Enables outlining of repeated computation blocks as common subgraphs during compilation. Defaults to False.

When set to True, this parameter can significantly reduce compilation time, particularly for Transformer-based models, making it suitable for quick verification builds. However, enabling it may slightly reduce runtime performance.
subgraph_repeat_hint (int)-- Specifies a reference value for the number of repeated blocks in the model to improve the quality of generated subgraphs when enable_common_subgraph is enabled. The default value is 20. For best practices, set this parameter to match the actual number of repeated blocks in the model. Values smaller than the actual number of repeated blocks will not be folded into subgraphs, and values exceeding the actual number are invalid.
enable_xh2_stable_output (bool)-- Ensures that inference outputs on the M50 products remain consistent across multiple runs of the same model. Defaults to False. Set to False for LLM models.
- True : Guarantees bit-accurate, consistent results across multiple runs of the same model. Note that this may cause a slight decrease in inference perfomance.
- False (Default): Provides maximum performance and faster execution, but outputs may exhibit minor random fluctuations due to the nature of parallel floating-point calculations.
flash_attention (int)-- Specifies if to use the Flash Attention optimization strategy. This is an experimental feature supported exclusively on XH2 hardware.
- 0: Disable Flash attention.
- 1: Enable Flash Attention at the computation graph level. This mode is currently not recommended for general use.
- 2: Enable Flash Attention internally within the attention operator. This mode is specifically designed for LLMs to enhance performance when the context length is greater than 2048 tokens.
Defaults to 0.
device_kernel_split (int)-- Specifies the number of sub-kernels to split a long-running kernel into. This parameter is used when multiple models run concurrently on multiple devices. It prevents a "heavy" model from blocking the runtime scheduler. For example, if a large model has a kernel running for 100ms and a small model has a kernel runs for 10ms, setting this value to 10 will split the large kernel into smaller segments.

This resolves Head-of-Line blocking, ensuring latency-sensitive tasks like ASR do not have to wait. This parameter is only effective for multi-device environments. Defaults to 1.
moe_device_sharding (Optional[str])-- Device sharding strategy used to distribute MoE expert weights across devices. Supported values are:
- "ep": (Default) Experts are partitioned across devices by expert count. Each device stores the full weights of a different subset of experts. For example, if four experts are distributed across two devices, each device stores two complete experts.
  
  This strategy provides the lowest memory footprint. However, decode execution may become load-imbalanced because different devices can activate different numbers of experts for each token.
- "tp": Every individual expert is sharded across all devices. Each device stores only 1 / ndevice of each expert's weights. For example, if four experts are distributed across two devices, each device stores all four experts, but only 1 / 2 of each expert's weights.
  
  Compared with expert parallel, this strategy provides more balanced decode execution and typically improves decode performance, but prefill performance becomes worse.
- "er": Each device stores the full weights of all experts.
  
  This strategy eliminates expert-related load imbalance during decode and typically delivers the best decode performance. Prefill performance remains similar to expert parallel, at the cost of increased memory usage.
enable_bundle_lora_param (bool)-- Controls if LoRA parameters are bundled into the compiled model inputs. When set to True, LoRA parameters are integrated into the model inputs during compilation, and LoRA input .npy files are generated in the output directory. Defaults to False.
custom_msg (str)-- Custom metadata to be embedded into the generated binary model file (.hmm) after model compilation. Defaults to an empty string. This information can later be retrieved using the Module::GetCustomMsg API in C++ or tcim_lite.runtime.Module.get_custom_msg API in Python.

Return Type

None.

Examples

Example1: Build model with all default settings (input is extracted from the quantized model with HMQuantool).

import onnx
import tcim

onnx_model = onnx.load("model.onnx")
tcim.build.build_from_hmonnx(onnx_model, output_name="model")