tcim.builder.api
build_from_hmonnx
build_from_hmonnx(onnx_model: Union[str, onnx.ModelProto],
output_name: str = "model",
output_dir: str = "./output",
work_dir: Union[str, None] = None,
target: str = "xh1",
ncore: int = 1,
opt_level: str = "O2",
**kwargs: Union[dict, None],)
Builds a TCIM model.
Parameters
onnx_model (str or onnx.ModelProto)--The ONNX model object to be built.output_name (str)--The name of the binary model file (.hmm) generated after model compilation. Default value is"model".output_dir (str)--The directory where the generated binary model file is saved. Defaults to the"./output".work_dir (Optional[str])--The directory for saving intermediate files generated during the compilation. Defaults to the./output/workspace.target (Optional[str])--The Houmo device hardware used for model inference. Supported values are:xh1: Represents Houmo M30 or H30 devices.xh2: Represents Houmo M50 devices.
Default value is
"xh1".ncore (int)--The number of IPU cores used for model inference. Supported values are1,2, and4. Defaults to1.opt_level (str)--The optimization level applied during compilation. Supported values areO0,O1, andO2. Default toO2. Higher optimization levels, such asO2, provide better performance and improved runtime efficiency, but result in longer compilation times. For faster compilation, typically used for testing, set this parameter toO0.kwargs (Optional[dict])--Additional configuration options. See "Other Parameters" section for details.
Other Parameters
weights (Optional[str, numpy.NDArray])--The input weights for large models (size ofonnx_model>= 2GB) where weights and the ONNX model are stored separately. Defaults toNone.batch (int)--The batch size for model inference and dynamic image resizing. Defaults to1.If only for model compilation and inference, it represents the batch size for model inference.
If set to
1, the input shape and batch size of the ONNX model is used. If set to a value greater than 1, the final batch size is calculated as:final_batch_size = (value of this parameter) x (batch size of the original ONNX model)
The first dimension of the original ONNX input shape is treated as the batch (N) dimension. If the first dimension is not batch, an error occurred.
If the dynamic image resizing function is used:
If
one_img_multi_roiis set toTrue, thebatchsize indicates the number of ROIs (Regions of Interest) to be resized from a single input image.If
one_img_multi_roiis set toFalse, each image is resized into a single ROI based on its individual ROI configuration. Thebatchsize indicates both the number of input images and the number of ROIs.
Attention
This parameter applies to non-LLM models only. For LLM models, use
modify_llm.batchto configure the batch size.one_img_multi_roi (bool)--Specifies if to enable dynamic resizing of a single input image into multiple ROIs. Defauls toFalse. When set toTrue, an input image is resized into multiple batches, and each ROI is resized independently. The number of ROIs is determined by the value specified in thebatchparameter.modify_llm (dict)--Dictionary to customize the batch size, context length, and input sequence length of a LLM during compilation. In the original model, these parameters are fixed and cannot be changed during inference. Using this parameter, the compiled model can support user-defined values for these parameters. The dictionary includes the following keys:batch (int)--Batch size for the decode model. Supported values are 1, 2, 4, and 8.context-length (int)--The maximum context length for both prefill and decode models, in tokens. Supported values are 256, 512, 1024, 2048, 4096, 8192, 16384, and 32768.fill-length (int)--The maximum sequence length to handle in a single forward pass in prefill phrase, in tokens. Supported values are 128, 256, and 512.
ndevice (int)--The number of devices used for model inference. Supported values are1,2. Defaults to1.io_layout (str)--The io_layout parameter can be a string (str) with the default value "stdv1". If it is a string, it only accepts "stdv1" or "any", which represent that all inputs and outputs use "stdv1" or "any".If it is set to "stdv1", it means that all inputs and outputs can be used as inputs or outputs for other models on the device side. This will improve the efficiency of the pipeline but reduce the performance of this model. If it is set to "any", it means that all inputs and outputs cannot be used as inputs or outputs for other models on the device side, but it will improve the performance of this model. The default value is "stdv1".
enable_profile (bool)--Specifies if to enable Profiler tool for performance analysis. Defaults toFalse. When set toTrue, the Profiler tool will record and analyze performance data for instructions executed on the IPU cores of the Houmo device. For more information, see "Profiler Tool User Guide".enable_dynamic_image_resize (bool)--Specifies if to enable dynamic resizing for a single input image. Defaults toFalse. When set toTrue, the dynamic resizing for input image is enabled. The generated binary model file will include a new input tensordyn_info. Before model inference, you can set the value for each dimension of thedyn_infotensor to specify the resizing configurations. This parameter is only supported when the batch size is 1.The data type of this tensor is int32, and its shape is [10]. The elements of this tensor are defined as
[cropY, cropX, crop_height, crop_width, resize_height, resize_width, pad_top, pad_left, pad_bottom, pad_right], where:cropY: Vertical coordinate (in pixels) of the top-left corner of the cropping region.cropX: Horizontal coordinate (in pixels) of the top-left corner of the cropping region.crop_height: The height (in pixels) of the cropping region.crop_width: The width (in pixels) of the cropping region.resize_height: Target height (in pixels) after resizing.resize_width: Target width (in pixels) after resizing.pad_top: Vertical padding (in pixels) applied to the top of the image.pad_left: Horizontal padding (in pixels) applied to the left side of the image.pad_bottom: Vertical padding (in pixels) applied to the bottom side cof the image.pad_right: Horizontal padding (in pixels) applied to the right side ofthe image.
Note: Make sure the following constraints are met:
dstH = outH - pad_top - pad_bottom, whereoutHis the original input height.dstW = outw - pad_left - pad_right, whereoutWis the original input width.
llm_opt (bool)--Enables performance optimization for LLM (Large Language Model) inference. Defaults toFalse. Set toTrueonly when performing inference with LLM models.j (int)--The number of CPU cores used for model compilation. Defaults to using all available CPU cores.enable_common_subgraph (bool)--Enables outlining of repeated computation blocks as common subgraphs during compilation. Defaults toFalse.When set to
True, this parameter can significantly reduce compilation time, particularly for Transformer-based models, making it suitable for quick verification builds. However, enabling it may slightly reduce runtime performance.subgraph_repeat_hint (int)--Specifies a reference value for the number of repeated blocks in the model to improve the quality of generated subgraphs whenenable_common_subgraphis enabled. The default value is20. For best practices, set this parameter to match the actual number of repeated blocks in the model. Values smaller than the actual number of repeated blocks will not be folded into subgraphs, and values exceeding the actual number are invalid.enable_xh2_stable_output (bool)--Ensures that inference outputs on the M50 products remain consistent across multiple runs of the same model. Defaults toFalse. Set toFalsefor LLM models.True: Guarantees bit-accurate, consistent results across multiple runs of the same model. Note that this may cause a slight decrease in inference perfomance.False(Default): Provides maximum performance and faster execution, but outputs may exhibit minor random fluctuations due to the nature of parallel floating-point calculations.
flash_attention (int)--Specifies if to use the Flash Attention optimization strategy. This is an experimental feature supported exclusively on XH2 hardware.0: Disable Flash attention.1: Enable Flash Attention at the computation graph level. This mode is currently not recommended for general use.2: Enable Flash Attention internally within the attention operator. This mode is specifically designed for LLMs to enhance performance when the context length is greater than 2048 tokens.
Defaults to
0.device_kernel_split (int)--Specifies the number of sub-kernels to split a long-running kernel into. This parameter is used when multiple models run concurrently on multiple devices. It prevents a "heavy" model from blocking the runtime scheduler. For example, if a large model has a kernel running for 100ms and a small model has a kernel runs for 10ms, setting this value to 10 will split the large kernel into smaller segments.This resolves Head-of-Line blocking, ensuring latency-sensitive tasks like ASR do not have to wait. This parameter is only effective for multi-device environments. Defaults to
1.custom_msg (str)--Custom metadata to be embedded into the generated binary model file (.hmm) after model compilation. Defaults to an empty string. This information can later be retrieved using the Module::GetCustomMsg API in C++ or tcim_lite.runtime.Module.get_custom_msg API in Python.
Return Type
None.
Examples
Example1: Build model with all default settings (input is extracted from the quantized model with HMQuantool).
import onnx
import tcim
onnx_model = onnx.load("model.onnx")
tcim.build.build_from_hmonnx(onnx_model, output_name="model")