4.3. LLM推理
TCIM支持部署LLM(Large Language Model,大语言模型)到后摩硬件设备上,包括Qwen3-14B和DeepSeek等模型。
Qwen模型包括以下部分:
Prefill模型:用来计算所有输入token,生成对应的KV Cache,并预测第一个输出token。
Decode模型:迭代的将预测输出的token送入模型,并生成对应的KV Cache,然后用本次Decode的结果预测下一个输出token。
下面以Qwen模型为例,介绍如何在单batch场景下,推理Qwen模型。示例展示关键步骤代码,仅供参考,不可以直接拷贝运行。用户可通过下面方式获取样例代码:
(仅限Linux系统)开发样例包中
houmo-examples_<release>/houmo-examples-xh2/models/llm目录下。(Linux系统和Windows系统) 开发样例包中
houmo-examples_<release>/houmo-examples-xh2/apis/inferences目录下。
4.3.1. 推理模型步骤
Qwen模型推理主要的流程如下:
用户输入文本作为查询(query)。
在tokenizer阶段,将文本转换为模型可以处理的数字输入。
在prefill阶段,对输入文本进行推理,并生成第一个输出token。
在Decode阶段,逐个输出token,生成最终输出文本(response)。
图 4.14 Qwen模型推理主要流程
模型部署主要使用PyTorch API和TCIM Python API完成。TCIM Python API主要用于推理模型。主要步骤如下:
注意
引入外部库时,必须先引入PyTorch库(import torch)再引入TCIM(import tcim_lite as tcim),否则会导致报错。
创建Weight Manager。由于Qwen模型比较大,在加载模型时,必须设置weight manager来共享存放weight值的设备内存。
对于多卡场景,需通过
tcim.runtime.DevManager指定多个后摩设备逻辑ID。示例如下:
self.ndevice = ndevice if self.ndevice == 1: weight_manager = tcim.runtime.WeightManager(0) elif self.ndevice == 2: dev_manager = tcim.runtime.DevManager([1, 0], "Xh2HalBackend") weight_manager = tcim.runtime.WeightManager(dev_manager) else: raise ValueError("Unsupport device number!") weight_manager = tcim.runtime.WeightManager(0)
创建Option对象,设置模型加载的参数配置。示例如下:
option1 = tcim.runtime.Option(weight_manager) option2 = tcim.runtime.Option(weight_manager) # Count the number of model blocks/layers. input_names = [] for i in range(self.prefill.get_num_inputs()): input_names.append(self.prefill.get_input_name(i)) pattern = r'^model_layers_(\d+)_self_attn_kcache_input$' count = sum(1 for item in input_names if re.match(pattern, item)) nblocks = count dummy_tensor_names = [f'model_layers_{i}_self_attn_kcache_input' for i in range(nblocks)] dummy_tensor_names += [f'model_layers_{i}_self_attn_vcache_input' for i in range(nblocks)] option2.set_dummy_tensors(dummy_tensor_names)
加载编译后模型文件,推理使用的stream被自动创建和设置。示例如下,必须按照下面顺序加载模型:
# Load model files for Prefill and Decode models accordlingly self.prefill_model = tcim.runtime.load(os.path.join(model_dir, "qwen3_prefill.hmm"), option = option1) self.decode_model = tcim.runtime.load(os.path.join(model_dir, "qwen3_decode.hmm"), option = option2)
获取推理所需的关键输入参数,包括:
prefill_length:Prefill阶段每次迭代可处理的总token数。通过Prefill模型的第一个输入张量的第一维获取。embedding_len:输入 token 的 embedding 向量维度。通过Prefill模型的第一个输入张量的第二维获取。decode_length:Decode阶段可处理的最大上下文长度。通过Decode模型的第一个输入张量的第二维获取。batch:Decode 模型支持的 batch 数。通过Decode模型的第一个输入张量的 shape的第0维获取。
示例如下:
self.prefill_length = self.prefill.get_input_info( self.prefill.get_input_name(0) ).shape[1] self.embedding_len = self.prefill.get_input_info( self.prefill.get_input_name(0) ).shape[2] self.decode_length = self.decode_model.get_input_info( self.decode_model.get_input_name(3) ).shape[2] self.batch = self.decode_model.get_input_info(self.decode_model.get_input_name(0)).shape[0]
为 Decode 阶段初始化上下文缓存(K/V cache)示例如下:
# set kvcache input for i in range(3, 2 * self.nblocks + 3): cache = self.prefill.get_input(self.prefill.get_input_name(i)) self.decode.set_input(self.decode.get_input_name(i), cache)
加载weight文件和tokenizer模型文件。示例如下:
# Load a pre-trained tokenizer model TOKENIZER_PATH = "qwen3-8b" self.tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True) # Load embedding weight EMBEDDING_PATH = os.path.join('output', HOUMO_TARGET, 'hmquant', 'quant_embedding.pt') embedding_weight = torch.load(embedding_path, map_location="cpu", weights_only=True)['weight'] self.embedding_weight = embedding_weight.reshape(-1, self.embedding_len)
Tokenize,将文本转换为模型可以处理的数字输入。如果输入长度超过Decode 处理的上下文长度(
decode_length),则提示错误。# The input message for the model messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": question,} ] # Apply the chat template to the input message text = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False ) # Tokenize the input text and return the result as tensors inputs = self.tokenizer(text, return_tensors="pt", add_special_tokens=False) # Decode token IDs back to text text = self.tokenizer.batch_decode(inputs.input_ids)[0] # Extract the token IDs from the tokenized input all_input_ids = inputs["input_ids"] # The length of the input sequence input_echo_len = all_input_ids.numel() # Ensure the input length does not exceed the maximum length if input_echo_len >= self.decode_length: logger.error(f"Question long than {self.decode_length}, please shorten it!") return f"Question long than {self.decode_length}, please shorten it!"
Prefill,通过理解输入文字,生成第一个token。
如果Prefill模型输入tokens太大,需对输入tokens做分段处理。Prefill每次迭代的最大上下文长度通过
prefill_length参数设置,Decode处理的最大上下文长度通过decode_length参数设置,示例代码如下:# The number of prefill loops to process the input prefill_loop_round = math.ceil(input_echo_len / self.prefill_length) # Loop through the input in chunks of prfill maximum length for round in range(prefill_loop_round): valid_length = round * self.prefill_length if round == prefill_loop_round - 1: current_length = input_echo_len - round * self.prefill_length input_ids = all_input_ids[:, round * self.prefill_length: input_echo_len] else: current_length = self.prefill_length input_ids = all_input_ids[:, round * self.prefill_length: (round + 1) * self.prefill_length]
分段后,为每段输入tokens:
创建输入Embeddings并填充。Embedding是将tokens映射到vector的过程。示例代码如下:
# Get the embeddings for the input tokens inputs_embeds = F.embedding(input_ids, self.embedding_weight) # Get the effective length of the current input sequence effective_length = input_ids.size(-1) # Create padding embeddings to pad the input sequence to the required length _pad_embeds = torch.zeros(1, self.prefill_length - effective_length, inputs_embeds.size(-1), dtype=inputs_embeds.dtype, device=inputs_embeds.device,) input_data = torch.cat([inputs_embeds, _pad_embeds], dim=1).reshape(1, self.prefill_length, self.embedding_len) valid_length_data = np.array([valid_length]).astype("int32") current_length_data = np.array([current_length]).astype("int32")
为每段输入tokens,推理Prefill模型。
示例代码如下:
# Set input for inference self.prefill_model.set_input(self.prefill.get_input_name(0), input_data.numpy()) self.prefill_model.set_input(self.prefill.get_input_name(1), valid_length_data) self.prefill_model.set_input(self.prefill.get_input_name(2), current_length_data) # Infer the model self.prefill_model.run() # Synchronize the model to ensure all operations are complete self.prefill_model.sync()
Decode模型推理输出的文本,并生成序列中的下一个token,将该token转换为其对应的vector embedding,以便进一步处理。示例代码如下:
# Get the output of Prefill inference input_data = self.prefill_model.get_output(self.prefill.get_output_name(0)).numpy() # Decode the next token id to get the prefill response text next_id = input_data.argmax(-1)[0] prefill_response = self.tokenizer.decode(next_id) # Extract the chat history token IDs for tracking conversation context chat_history_ids = all_input_ids[0] # Convert next_id to tensor and get the corresponding embedding next_id = torch.from_numpy(next_id) # Append the new token ID to the chat history chat_history_ids = torch.cat([chat_history_ids, next_id], dim=-1) input_data = F.embedding(next_id.unsqueeze(0), self.embedding_weight).reshape(1, 1, -1) all_response = prefill_response
Decode,逐个输出token。
设置Decode模型输入。
循环遍历输出tokens,直到输出完成或达到处理的最大上下文长度(
decode_length):推理Decode模型。
使用Decode模型推理输出的文本,生成序列中的下一个token,并根据滑动窗口策略,只考虑最近
slide_len个token(即最新的上下文)。生成的 token会被转换为对应的vector embedding,并作为下一次推理的输入。slide_len表示滑动窗口的大小,它决定了在模型生成下一个token时,所使用的历史上下文的长度。该参数取值建议根据对话的相关性、计算资源和文本连贯性来调整。如果对话较短且上下文要求不高,可设置较小的值,比如10;如果需要处理长对话或确保生成较长文本的连贯性,可设置较大的值,比如20~50。较大的slide_len会增加计算开销,因此需要在计算资源和对话质量之间做出平衡。
输出总文本(
all_response)。总输出文本为Prefill输出和Decode输出。
示例如下:
context_length = input_echo_len # Counter to track skipped tokens skip_tokens = 0 # Sliding window length for tracking the last part of the response slide_len = 10 # Decode the last slide_len tokens from the chat history last_response = self.tokenizer.decode(chat_history_ids.tolist()[-slide_len:]) # (1) Set input for decode model # Create an array for the current length of the sequence current_length_input_1 = np.array([1]).astype("int32") # Set the current length input for decode_model decode_current_length_name = self.decode.get_input_name(2) self.decode_model.set_input(decode_current_length_name, current_length_input_1) # (2) Iterate to generate output tokens while True: # Break if maximum number of rounds is exceeded if context_length >= self.decode_length: logger.info(f"context length greater than {self.decode_length}, break!") break # Set input of Decode model self.decode_model.set_input(self.decode.get_input_name(0), input_data.numpy()) valid_length_data = np.array(context_length).astype("int32") self.decode_model.set_input(self.decode.get_input_name(1), valid_length_data) # a. Infer the Decode model self.decode_model.run() self.decode_model.sync() # b. Get the output of Decode inference input_data = self.decode_model.get_output(self.decode.get_output_name(0)).numpy() # Decode the next token id to get the decode response text next_id = input_data.astype(np.float32).argmax(-1)[0] # Convert the next_id to a PyTorch tensor next_id = torch.from_numpy(next_id) # If the model outputs the end-of-sequence token, stop decoding if next_id == self.tokenizer.eos_token_id: # Print the final response print(decode_response, end="",flush=True) # Append to the accumulated response all_response += decode_response break # Append the newly generated token ID to the chat history chat_history_ids = torch.cat([chat_history_ids, next_id], dim=-1) # Decode the last part of the response, considering the sliding window and skipped tokens decode_response = self.tokenizer.decode(chat_history_ids.tolist()[-(slide_len+1)-skip_tokens:])[len(last_response):] # If the decoded response is valid and ends with a proper character if decode_response != '' and is_valid_char(ord(decode_response[-1])): # Print the decoded response in real-time print(decode_response, end="", flush=True) # Append to the accumulated response all_response += decode_response # Update last response last_response = self.tokenizer.decode(chat_history_ids.tolist()[-slide_len:]) # Reset skipped token count skip_tokens = 0 else: # Increment the skip token counter if the response is incomplete skip_tokens += 1 # Convert next_id to an embedding tensor for the next decoding iteration input_data = F.embedding(next_id.unsqueeze(0), self.embedding_weight).reshape(1, 1, -1) context_length = context_length + 1
完整示例代码参看开发样例包中 houmo-examples_<release>/houmo-examples/models/llm/qwen3 目录下。