Summary

Overview

Video Tutorial: Build Along: Run LLMs Locally on Qualcomm Hardware Using ExecuTorch

This file provides you the instructions to run LLM Decoder model, VLM model, and ALM model with different parameters via Qualcomm HTP backend. We currently support the following models:

Large language models

LLAMA2 Stories 110M
LLAMA3.2 1B
LLAMA3.2 3B
Codegen2 1B
Gemma 2B
Gemma2 2B
Gemma3 1B
GLM 1.5B
Granite3.3 2B
Phi4-mini-instruct
QWEN2.5 0.5B / 1.5B
QWEN3 0.6B / 1.7B
SmolLM2 135M
SmolLM3 3B

Vision-Language Models

SmolVLM 500M
InternVL3 1B

Audio-Language models

Granite-speech-3.3-2b

We offer the following modes to execute the model:

KV Cache Mode: In KV Cache mode, the model takes in a single previous token and generates the next predicted token along with its KV cache. It is efficient for generating subsequent tokens after the initial prompt.
Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache modes to optimize token generation speed. Initially, it uses AR-N model to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
- AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.
- Prompt processing with AR-N model:
Prompt processing is done using a for-loop. An N-token block is taken, and the KV cache is updated for that block. This process is repeated until all tokens are consumed, with the last block potentially requiring padding. For flexibility, the AR-N model can handle any input length less than the maximum sequence length. For TTFT, the input length (or number of blocks) will vary depending on the actual input length, rather than always being the same.

Lookahead Mode: Lookahead Mode introduces lookahead decoding and uses AR-N model to process prompt to enhance token generation speed. While decoding multiple tokens in a single step is infeasible, an LLM can generate multiple guess tokens in parallel. These guess tokens may fit into future parts of the generated sequence. The lookahead decoder generates and verifies these guess tokens, integrating them into the sequence if suitable. In some cases, it can obtain more than one token in a single step. Result is lossless.

Hardware Support

We’ve validated this flow on the Samsung Galaxy S23, Samsung Galaxy S24, Samsung Galaxy S25, and OnePlus 12.
Support on other hardware depends on the HTP architecture (HtpArch) and the feature set available on that version.

HTP Minimum Version Requirements

LPBQ (16a4w block-wise quantization) requires V69 or newer
Weight sharing between prefill and decode requires V73 or newer
16-bit activations + 16-bit weights for matmul (e.g., 16-bit KV cache) requires V73 or newer

Quantization Guidance for Older Devices

For older HTP versions, you may need to adjust the quantization strategy. Recommended starting points:

Use 16a4w as the baseline
Optionally apply SpinQuant
Use 16a8w selectively on some layers to further improve accuracy (mixed-precision quantization)

Memory Limit Errors (4 GB HTP Limit)

If you encounter errors like the following, it typically means the model’s requested memory exceeds the 4 GB per-context limit on HTP.
To resolve this, try increasing the sharding number (num_sharding) to reduce per-shard memory usage:

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to find available PD for contextId 1 on deviceId 0 coreId 0 with context size estimate 4025634048
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> context create from binary failed on contextId 1
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Fail to create context from binary with err 1002
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Size Calculation encounter error! Doing Hard reset of reserved mem to 0.
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create context from binary with err 0x3ea
[ERROR] [Qnn ExecuTorch]: Can't create context from binary

Instructions

Note

For hybrid mode, the export time will be longer and can take up to 1-4 hours to complete, depending on the specific model users are exporting.
When exporting a hybrid mode model, memory consumption will be higher. Taking LLAMA3.2 1B as an example, please ensure the device has at least 80 GB of memory and swap space.

Step 1: Setup

Follow the tutorial to set up ExecuTorch.
Follow the tutorial to build Qualcomm AI Engine Direct Backend.
Please install the llm eval dependency via examples/models/llama/install_requirements.sh

Step 2: Prepare Model

LLAMA2

Download and prepare stories110M model

# tokenizer.model & stories110M.pt:
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"

# tokenizer.bin:
python -m pytorch_tokenizers.tools.llama2c.convert -t tokenizer.model -o tokenizer.bin

# params.json:
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json

LLAMA3.2

Follow the instructions to download models. At the end of this step, users should have the following files ready: consolidated.00.pth, params.json, and tokenizer.model.

Step3: Run default examples.

Note:

All example scripts below use hybrid mode, which is optimized for on-device performance. However, compiling a model in hybrid mode can consume a significant amount of memory on the host machine—sometimes up to ~100 GB. If your host machine has limited memory, it is highly recommended to switch from --model_mode hybrid to --model_mode kv and remove the --prefill_ar_len flag.

LLAMA2

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint stories110M.pt --params params.json --tokenizer_model tokenizer.model --tokenizer_bin tokenizer.bin --decoder_model stories110m --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "Once upon a time"

LLAMA3.2 1B Instruct

Default example using hybrid mode.

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1

LLAMA3.2 3B Instruct

Default example using hybrid mode.

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1