set_dummy_tensors
- Option.set_dummy_tensors(names)
Sets a list of tensor names of a model.
By default, memory for input and output tensors is allocated during model loading. If this method is called, the memory is not allocated for these tensors when loading a model.
Note
This function is used in scenarios where multiple models are inferred sequentially, with the output of one model as the input of another model. It optimizes memory usage by avoiding unnecessary memory allocation.
The memory for input and output tensors must be allocated before model inference. You can call
tcim_lite.runtime.Module.get_output()andtcim_lite.runtime.Module.set_input()to set the output of one model as input of another model.- Parameters:
names (list(str)) -- Name list of the dummy tensors.
Examples
import tcim_lite as tcim # Create a weight manager weight_manager = tcim.runtime.WeightManager(0) # Create an Option object to set configurations for models option1 = tcim.runtime.Option(weight_manager) option2 = tcim.runtime.Option(weight_manager) # Set dummy sensor names dummy_tensor_names = [f'model_layers_{i}_self_attn_kcache_input' for i in range(nblocks)] option2.set_dummy_tensors(dummy_tensor_names) # Load Qwen models self.prefill_part1_model = tcim.runtime.load("qwen2_prefill_part1.hmm", option = option1) self.decode_part1_model = tcim.runtime.load("qwen2_decode_part1.hmm", option = option2) # Get the output of prefill_part1_model kcache = self.prefill_part1_model.get_input(f'model_layers_{i}_self_attn_kcache_input') # Set the output of prefill_part1_model as input of decode_part1_model self.decode_part1_model.set_input(f'model_layers_{i}_self_attn_kcache_input', kcache)