set_dummy_tensors

Option.set_dummy_tensors(names)

Sets a list of tensor names of a model.

By default, memory for input and output tensors is allocated during model loading. If this method is called, the memory is not allocated for these tensors when loading a model.

Note

This function is used in scenarios where multiple models are inferred sequentially, with the output of one model as the input of another model. It optimizes memory usage by avoiding unnecessary memory allocation.

The memory for input and output tensors must be allocated before model inference. You can call tcim_lite.runtime.Module.get_output() and tcim_lite.runtime.Module.set_input() to set the output of one model as input of another model.

Parameters:

names (list(str)) -- Name list of the dummy tensors.

Examples

import tcim_lite as tcim

# Create a weight manager
weight_manager = tcim.runtime.WeightManager(0)
# Create an Option object to set configurations for models
option1 = tcim.runtime.Option(weight_manager)
option2 = tcim.runtime.Option(weight_manager)
# Set dummy sensor names
dummy_tensor_names = [f'model_layers_{i}_self_attn_kcache_input' for i in range(nblocks)]
option2.set_dummy_tensors(dummy_tensor_names)
# Load Qwen models
self.prefill_part1_model = tcim.runtime.load("qwen2_prefill_part1.hmm", option = option1)
self.decode_part1_model = tcim.runtime.load("qwen2_decode_part1.hmm", option = option2)
# Get the output of prefill_part1_model
kcache = self.prefill_part1_model.get_input(f'model_layers_{i}_self_attn_kcache_input')
# Set the output of prefill_part1_model as input of decode_part1_model
self.decode_part1_model.set_input(f'model_layers_{i}_self_attn_kcache_input', kcache)