Video Tutorial: Build Along: Run LLMs Locally on Qualcomm Hardware Using ExecuTorch
This file provides you the instructions to run LLM Decoder model, VLM model, and ALM model with different parameters via Qualcomm HTP backend. We currently support the following models:
- Large language models
- LLAMA2 Stories 110M
- LLAMA3.2 1B
- LLAMA3.2 3B
- Codegen2 1B
- Gemma 2B
- Gemma2 2B
- Gemma3 1B
- GLM 1.5B
- Granite3.3 2B
- Phi4-mini-instruct
- QWEN2.5 0.5B / 1.5B
- QWEN3 0.6B / 1.7B
- SmolLM2 135M
- SmolLM3 3B
- Vision-Language Models
- SmolVLM 500M
- InternVL3 1B
- Audio-Language models
- Granite-speech-3.3-2b
We offer the following modes to execute the model:
-
KV Cache Mode: In KV Cache mode, the model takes in a single previous token and generates the next predicted token along with its KV cache. It is efficient for generating subsequent tokens after the initial prompt.
-
Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache modes to optimize token generation speed. Initially, it uses AR-N model to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
- AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.
- Prompt processing with AR-N model:
Prompt processing is done using a for-loop. An N-token block is taken, and the KV cache is updated for that block. This process is repeated until all tokens are consumed, with the last block potentially requiring padding. For flexibility, the AR-N model can handle any input length less than the maximum sequence length. For TTFT, the input length (or number of blocks) will vary depending on the actual input length, rather than always being the same.
- Lookahead Mode: Lookahead Mode introduces lookahead decoding and uses AR-N model to process prompt to enhance token generation speed. While decoding multiple tokens in a single step is infeasible, an LLM can generate multiple guess tokens in parallel. These guess tokens may fit into future parts of the generated sequence. The lookahead decoder generates and verifies these guess tokens, integrating them into the sequence if suitable. In some cases, it can obtain more than one token in a single step. Result is lossless.
We’ve validated this flow on the Samsung Galaxy S23, Samsung Galaxy S24, Samsung Galaxy S25, and OnePlus 12.
Support on other hardware depends on the HTP architecture (HtpArch) and the feature set available on that version.
- LPBQ (16a4w block-wise quantization) requires V69 or newer
- Weight sharing between prefill and decode requires V73 or newer
- 16-bit activations + 16-bit weights for matmul (e.g., 16-bit KV cache) requires V73 or newer
For older HTP versions, you may need to adjust the quantization strategy. Recommended starting points:
- Use 16a4w as the baseline
- Optionally apply SpinQuant
- Use 16a8w selectively on some layers to further improve accuracy (mixed-precision quantization)
If you encounter errors like the following, it typically means the model’s requested memory exceeds the 4 GB per-context limit on HTP.
To resolve this, try increasing the sharding number (num_sharding) to reduce per-shard memory usage:
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to find available PD for contextId 1 on deviceId 0 coreId 0 with context size estimate 4025634048
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> context create from binary failed on contextId 1
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Fail to create context from binary with err 1002
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Size Calculation encounter error! Doing Hard reset of reserved mem to 0.
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create context from binary with err 0x3ea
[ERROR] [Qnn ExecuTorch]: Can't create context from binary
- For hybrid mode, the export time will be longer and can take up to 1-4 hours to complete, depending on the specific model users are exporting.
- When exporting a hybrid mode model, memory consumption will be higher. Taking LLAMA3.2 1B as an example, please ensure the device has at least 80 GB of memory and swap space.
- Follow the tutorial to set up ExecuTorch.
- Follow the tutorial to build Qualcomm AI Engine Direct Backend.
- Please install the llm eval dependency via examples/models/llama/install_requirements.sh
Download and prepare stories110M model
# tokenizer.model & stories110M.pt:
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
# tokenizer.bin:
python -m pytorch_tokenizers.tools.llama2c.convert -t tokenizer.model -o tokenizer.bin
# params.json:
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.jsonFollow the instructions to download models.
At the end of this step, users should have the following files ready: consolidated.00.pth, params.json, and tokenizer.model.
All example scripts below use hybrid mode, which is optimized for on-device performance. However, compiling a model in hybrid mode can consume a significant amount of memory on the host machine—sometimes up to ~100 GB. If your host machine has limited memory, it is highly recommended to switch from --model_mode hybrid to --model_mode kv and remove the --prefill_ar_len flag.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint stories110M.pt --params params.json --tokenizer_model tokenizer.model --tokenizer_bin tokenizer.bin --decoder_model stories110m --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "Once upon a time"Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using kv mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model codegen2_1b --model_mode kv --max_seq_len 1024 --prompt "def hello_world():" Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma-2b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma2-2b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma3-1b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model glm-1_5b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model granite_3_3-2b_instruct --prompt "I would like to learn python, could you teach me with a simple example?" --eval_methods tasks_eval --task hellaswag --limit 10Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model phi_4_mini --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model qwen2_5-0_5b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --decoder_model qwen2_5-1_5b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model qwen3-0_6b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --decoder_model qwen3-1_7b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smollm2_135m --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smollm3-3b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1Multimodal models extend LLM by processing multiple input modalities (vision, audio, text) simultaneously. This framework provides a unified architecture for multimodal via Qualcomm HTP backend.
Current Support Status:
- Vision-Language Models (VLM): Fully supported
- Audio-Language Models (ALM): Fully supported
For general multimodal processing pipeline please refer Multimodal Architecture
Multimodal inference follows these key stages:
-
Modality-Specific Encoding
- Vision: Images are processed through a vision encoder to generate visual embeddings
- Audio: Audio waveforms are processed through an audio encoder to generate audio embeddings
- Text: Text prompts are tokenized and embedded
-
Embedding Fusion
- All modality embeddings are projected to a common embedding dimension
- Embeddings are concatenated or fused according to the model's template
- Special tokens are inserted to mark modality boundaries
-
Unified Language Generation
- The fused embeddings are fed into the language model decoder
- The decoder generates text autoregressively using the same execution modes as LLM models (KV Cache, Hybrid, Lookahead)
Audio-Language Models (ALMs) combine speech/audio processing and natural language processing to understand and generate text based on audio inputs. ALMs in this framework consist of:
ALM models require the soundfile package for audio loading:
pip install soundfile- Audio Encoder: Processes raw audio waveforms into audio embeddings (e.g., CTC encoder for Granite-speech)
- Projector (included in audio encoder): Aligns audio embeddings with the language model's embedding space.
- Language Decoder: Reuse static llama to generates text based on fused visual and text embeddings.
Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"You can specify a custom audio file for ALM models using the --audio_path flag:
- HTTP/HTTPS URLs: Direct links to audio on the web
- Example:
"https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"
- Example:
- HuggingFace repository filenames: Files that exist in the model's HuggingFace repository are automatically downloaded
- Example:
"10226_10111_000000.wav"(auto-downloaded fromibm-granite/granite-speech-3.3-2b)
- Example:
- Local file paths: Absolute or relative paths to
.wavfiles on your system- Example:
"/path/to/your/audio.wav"
- Example:
Default behavior:
If --audio_path is not specified, the system will automatically use the default audio file defined in the model's configuration file (encoder/encoder_config.py).
The audio encoder configuration is defined in encoder/encoder_config.py:
# In encoder/encoder_config.py
@dataclass(init=False, frozen=True)
class GraniteSpeechEncoder(AudioModalityConfig):
encoder_class = GraniteSpeechCTCEncoderWrapper
audio_seq_len = 171
audio_url = "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true" # Default audio (content: "After his nap, ...")
quant_recipe = GraniteSpeechEncoderQuantRecipe- audio_seq_len: Number of audio tokens generated by the encoder.
The audio is automatically:
- Loaded from the specified file path or downloaded from HuggingFace
- Read as a waveform using
soundfileand converted to a float tensor of shape[1, T] - Processed by the HuggingFace
AutoProcessorto produce mel-filterbank features of shape(1, 844, 160) - Passed through the CTC encoder and QFormer projector to produce audio embeddings of shape
[1, audio_seq_len, hidden_dim]
If you have already compiled a ALM model, you can run inference with pre-generated PTE files:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true" --pre_gen_pte ${FOLDER_TO_PRE_GEN_PTE}The ALM inference pipeline consists of:
-
Audio Encoding Phase
- Input audio waveform is loaded and preprocessed into mel-filterbank features:
(1, 844, 160) - CTC encoder extracts acoustic features using Conformer blocks with block-wise local attention
- QFormer projector aligns audio embeddings to the language model dimension:
[batch, audio_seq_len, hidden_dim]
- Input audio waveform is loaded and preprocessed into mel-filterbank features:
-
Text Tokenization Phase
- User prompt is tokenized into text tokens
- Text tokens are embedded:
[batch, text_seq_len, hidden_dim]
-
Embedding Fusion Phase
- Audio and text embeddings are concatenated according to the model's template
- The
<audio>placeholder in the prompt is expanded to the model-specific special token<|audio|> - Final fused sequence:
[batch, audio_seq_len + text_seq_len, hidden_dim]
-
Language Generation Phase
- Fused embeddings are fed into the language decoder
- Autoregressive generation produces output tokens using KV cache mode
- KV cache is updated for efficient subsequent token generation
Vision-Language Models (VLMs) combine computer vision and natural language processing to understand and generate text based on visual inputs. VLMs in this framework consist of:
- Vision Encoder: Processes images into visual embeddings (e.g., SigLIP for SmolVLM)
- Projector (included in vision encoder): Aligns visual embeddings with the language model's embedding space
- Language Decoder: Reuse static llama to generates text based on fused visual and text embeddings
Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode hybrid --prefill_ar_len 16 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"Default example using hybrid mode.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model internvl3_1b --model_mode hybrid --prefill_ar_len 32 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg"You can specify custom image for VLM models using the --image_path flag:
Take a example image of Statue-of-Liberty in New York Bay
- HTTP/HTTPS URLs: Direct links to images on the web
- Example:
https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg
- Example:
- Local file paths: Absolute or relative paths to image files on your system
Default behavior:
If --image_path is not specified, the system will automatically use the default image URL defined in the model's configuration file (encoder/encoder_config.py).
Each VLM model has specific preprocessing requirements defined in its configuration:
# In encoder/encoder_config.py
@dataclass(init=False, frozen=True)
class SmolVLMEncoder(VisionModalityConfig):
encoder_class = Idefics3VisionEncoder
img_seq_len = 64
img_resized_h = 512
img_resized_w = 512
img_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" # Default image
quant_recipe = SmolVLMEncoderQuantRecipe- img_resized_h / img_resized_w: Target resolution for the vision encoder
- img_seq_len: Number of visual tokens generated by the encoder
The image is automatically:
- Loaded from the specified URL or file path
- Resized to the model's expected resolution and preprocessed by HuggingFace processors
If you have already compiled a VLM model, you can run inference with pre-generated PTE files:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" --pre_gen_pte ${FOLDER_TO_PRE_GEN_PTE}The framework supports multi-turn conversations with VLMs, allowing you to conduct dialogues that can involve multiple images.
- Multi-Turn Prompts: To engage in a conversation, provide multiple prompts sequentially using the
--promptargument. Each string will be treated as a separate turn. - Multiple Images: You can supply multiple images (from URLs or local paths) using the
--image_pathargument. - Flexible Image Placement: Use the
<image>token within your prompt to specify exactly where each image's embeddings should be placed. The images provided via--image_pathwill replace the<image>tokens in the order they appear.
Example:
In this example, the first turn compares two images, the second turn asks a follow-up question about the first image, and the third turn asks for a caption for a third image.
# Define image URLs and prompts for a 3-turn conversation
IMAGE1_URL="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
IMAGE2_URL="http://images.cocodataset.org/val2017/000000039769.jpg"
IMAGE3_URL="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
PROMPT1="<image><image>Compare these images above and list the differences."
PROMPT2="Answer the question: What's the main object in first image?"
PROMPT3="<image>Caption this image."
# Execute the multi-turn conversation
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 2048 --prompt "$PROMPT1" "$PROMPT2" "$PROMPT3" --image_path "$IMAGE1_URL" "$IMAGE2_URL" "$IMAGE3_URL"How it works:
- Turn 1: The prompt
"<image><image>Compare these images above and list the differences."uses the first two images ($IMAGE1_URL,$IMAGE2_URL). - Turn 2: The prompt
"Answer the question: What's the main object in first image?"is a text-only follow-up. The conversation context is maintained from the previous turn. - Turn 3: The prompt
"<image>Caption this image."uses the third image ($IMAGE3_URL).
The VLM inference pipeline consists of:
-
Vision Encoding Phase
- Input image is preprocessed (resize, normalize)
- Vision encoder generates visual embeddings:
[batch, img_seq_len, hidden_dim] - Visual embeddings are projected to match the language model dimension by the modality projector
-
Text Tokenization Phase
- User prompt is tokenized into text tokens
- Text tokens are embedded:
[batch, text_seq_len, hidden_dim]
-
Embedding Fusion Phase
- Visual and text embeddings are concatenated according to the model's template
- Special tokens (e.g.,
<image>,<|fake_token_around_image|>,<fake_token_around_image>) mark modality boundaries (see tokenizer.py)
# Special tokens for Vision-Language Model VLM_SPECIAL_TOKENS = { "smolvlm_500m_instruct": { "image_token": "<image>", "global_img": "<global-img>", "fake_wrap_start": "<fake_token_around_image>", "fake_wrap_end": "<fake_token_around_image>", }, ... }
- Final fused sequence:
[batch, img_seq_len + text_seq_len, hidden_dim]
-
Language Generation Phase
- Fused embeddings are fed into the language decoder
- Autoregressive generation produces output tokens
- KV cache is updated for efficient subsequent token generation
We use Smart Mask mechanisms for updating the key-value (KV) cache.
The figure illustrates how key and value caches are updated during each inference step. The Smart Mask mechanism simplifies updating tokens in the cache by modifying only the new token at the designated position. This approach is useful for shared buffers, though it does require copying data in CPU memory to update the kv cache.| Mechanism | Time Complexity | Space Complexity | ||
|---|---|---|---|---|
| K | V | K | V | |
| Smart Mask | num_head * head_dim | num_head * head_dim | num_head * seq_len * head_dim | num_head * seq_len * head_dim |
If you would like to compile the model only, we have provided the flag --compile_only. Taking LLAMA3.2 as an example:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2 --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "I would like to learn python, could you teach me with a simple example?" --compile_onlyOn the other hand, if you already have a pre-compiled .pte model, you can perform inference by providing the flag --pre_gen_pte and specifying the folder that contains the .pte model. Taking LLAMA3.2 as an example:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2 --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "I would like to learn python, could you teach me with a simple example?" --pre_gen_pte ${FOLDER_TO_PRE_GEN_PTE}You can choose the lookahead mode to enhance decoding speed. To use this mode, you need to specify the following parameters:
--ngram(N-gram size): Represents the size of the n-grams used in the lookahead process.--window(window size): Determines how many future tokens the algorithm attempts to predict in each step.--gcap(Verification candidates): Represents the maximum number of speculations or candidate n-grams that the algorithm considers in each step for verification. It balances the trade-off between computation efficiency and exploring more possibilities.
For more details, please refer to the paper "Break the Sequential Dependency of LLM Inference Using Lookahead Decoding"
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2 --model_mode lookahead --prefill_ar_len 32 --max_seq_len 128 --prompt "I would like to learn python, could you teach me with a simple example?" --ngram 3 --window 2 --gcap 2This script supports task evaluation and is capable of assessing evaluation scores across 3 phases: prepare_pt2e(CPU FP), convert_pt2e(CPU QDQ), QNN on device.
To evaluate the perplexity across all 3 phases, users should provide the --eval_methods tasks_eval flag and specify the evaluation task. Please notice when this flag is provided, the --prompt ${PROMPT} will be ignored.
For example, using the Qwen model and 1 wikitext sample as the evaluation task, users can assess all 3 phases perplexity score in a single run by including the appropriate configuration:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --tasks wikitext --limit 1 --verboseFrom the example script above, 1 wikitext sample is used to evaluate all 3 phases. However, there are cases where a user may want to use one sample for quantization calibration and multiple samples for perplexity evaluation. In this case, the process should be split into two runs. In the 1st run, the model is compiled using one sample. In the 2nd run, the user can provide a different configuration for QNN device execution. Example:
# 1st run to compile with --limit 1
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --tasks wikitext --limit 1 --compile_only# 2nd run to perform QNN device execution with --limit 3
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods tasks_eval --tasks wikitext --limit 3 --pre_gen_pte ${PATH_TO_ARTIFACT_IN_1ST_RUN} --quant_attrs_path ${PATH_TO_ARTIFACT_IN_1ST_RUN}/kv_llama_qnn_quant_attrs.jsonIf --tasks ${TASK} is not provided, the program will use --prompt ${PROMPT} as the dataset for quantization calibration.
Regardless of whether --eval_methods tasks_eval is provided, as long as --tasks ${TASK} is specified, the specified tasks will be used for model quantization calibration instead of the prompt.
To evaluate QNN's output logits against the golden logits from nn.Module, users can provide the flag --sqnr_eval. Please note that SQNR evaluation will only compare the logits of the user's prompt and will not compare the new tokens generated by the model.
Example:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen2_5-0_5b --eval_methods sqnr_evalTo automatically identify sensitive layers and generate a mixed-precision recipe suggestion, add the --quant_recipe_suggestion flag. During calibration, the analyzer compares FP32 and QDQ intermediate outputs layer-by-layer using SQNR, then writes two files to the working directory:
{model_name}_quantization_error.csv— per-group SQNR statistics sorted by sensitivity (most sensitive first){model_name}_suggest_recipe.py— ready-to-useStaticLLMQuantRecipesubclasses optimized to apply higher-precision quantization to the most sensitive groups.
Example:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "I would like to learn python, could you teach me with a simple example?" --temperature 0 --model_mode kv --max_seq_len 1024 --decoder_model qwen3-1_7b --tasks wikitext --limit 1 --quant_recipe_suggestion --compile_onlyAfter the run, pick one of the generated classes from qwen3-1_7b_suggest_recipe.py as your new recipe. For a full walkthrough, see quantization_guidance.md.
Attention sink is a way to evict cache when maximum context length be reached. There are two mainly concept for attention sink:
- Maintain Attention Sinks: Always include several initial tokens as attention sinks in the kv cache.
- Redefine Positional Context: Use positions relative to the cache instead of absolute positions from the original text, enhancing relevance and coherence in generated responses.
This feature supports fluent multi-turn conversations and manages long-context scenarios. To enable it, set --use_attention_sink <sink_size>,<batch_eviction_size>.
--max_seq_len: Maximum sequence length the model can generate--max_context_len: Maximum length of the model's memory/cache, including both prompt tokens and generated tokens<sink_size>: Always includesink_sizeinitial tokens as attention sinks in the kv cache.<batch_eviction_size>: How many tokens to evict from the cache at once when the cache is full.
Example:
# Compile llama pte file and attention sink evictor pte file with sink_size = 4 and batch_eviction_size = 64
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 4096 --max_context_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --use_attention_sink 4,64 --compile_onlyAfter running this, the attention_sink_evictor.pte file will be generated in the artifacts directory. This file is necessary for using the attention sink feature, as it handles removing the eviction_batch_size tokens from the kv cache, retaining the first sink_size tokens, and re-rotating the remaining tokens in the kv cache.
For multi-turn conversations or scenarios with long context using attention sink, you can set max_seq_len higher than the max_context_len used during compilation:
# Run llama with attention sink in multi-turn conversation scenario
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 4096 --prompt "I would like to learn python, could you teach me with a simple example?" "Could you give more difficult example in python?" "Could you add a GUI for this game?" "Could you tell me more about tkinter?" "Is possible to deploy on website?" ---pre_gen_pte ${PATH_TO_ARTIFACT_IN_1ST_RUN} --use_attention_sink 4,64 If you want to modify sink_size or batch_eviction_size, or if you have a pre-compiled llm pte file and wish to use the attention sink feature, you can recompile the attention_sink_evictor.pte with different attention sink config.
# Compile attention sink evictor pte file with sink_size = 4 and batch_eviction_size = 128
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 4096 --prompt "I would like to learn python, could you teach me with a simple example?" "Could you give more difficult example in python?" "Could you add a GUI for this game?" "Could you tell me more about tkinter?" "Is possible to deploy on website?" ---pre_gen_pte ${PATH_TO_ARTIFACT_IN_1ST_RUN} --use_attention_sink 4,128 Please make sure to use the same --max_context_len, --prefill_ar_len, and --model_mode, etc., as those used in the LLM to ensure the kv cache shape is correct.

