ZeticMLangeLLMModel
API reference for running LLM inference on Android with ZeticMLangeLLMModel.
This page reflects ZeticMLange Android 1.7.0-beta.1.
ZeticMLangeLLMModel is the Android entry point for on-device LLM inference. The current API has two constructor families:
- Automatic selection by
LLMModelMode - Explicit selection by
LLMTarget,LLMQuantType, andAPType
Use the automatic constructor first. Use the explicit constructor when you need fixed GGUF selection or processor control.
Package
com.zeticai.mlange.core.model.llmImport
import com.zeticai.mlange.core.model.llm.ZeticMLangeLLMModelConstructors
Automatic Selection (Recommended)
This constructor selects the runtime and quantization automatically from model metadata.
ZeticMLangeLLMModel(
context: Context,
personalKey: String,
name: String,
version: Int? = null,
modelMode: LLMModelMode = LLMModelMode.RUN_AUTO,
dataSetType: LLMDataSetType? = null,
onProgress: ((Float) -> Unit)? = null,
onStatusChanged: ((ModelLoadingStatus) -> Unit)? = null,
cacheHandlingPolicy: ModelCacheHandlingPolicy = ModelCacheHandlingPolicy.REMOVE_OVERLAPPING,
initOption: LLMInitOption = LLMInitOption(),
)| Parameter | Type | Default | Description |
|---|---|---|---|
context | Context | - | Android context used for cache and file access. |
personalKey | String | - | Personal key. See Personal Key. |
name | String | - | Pre-built model key or Hugging Face repository ID. |
version | Int? | null | Model version. null loads the latest version. |
modelMode | LLMModelMode | RUN_AUTO | Automatic selection strategy. |
dataSetType | LLMDataSetType? | null | Optional dataset hint for accuracy-oriented selection. |
onProgress | ((Float) -> Unit)? | null | Download progress callback from 0.0 to 1.0. |
onStatusChanged | ((ModelLoadingStatus) -> Unit)? | null | Loading status callback for asset-pack or download state changes. |
cacheHandlingPolicy | ModelCacheHandlingPolicy | REMOVE_OVERLAPPING | Managed artifact cache policy. |
initOption | LLMInitOption | LLMInitOption() | LLM initialization options such as KV-cache cleanup and requested context length. |
Detailed cacheHandlingPolicy behavior and ModelCacheManager usage are currently TBD. See Cache Management.
val model = ZeticMLangeLLMModel(
context = context,
personalKey = PERSONAL_KEY,
name = "google/gemma-3-4b-it",
modelMode = LLMModelMode.RUN_AUTO,
initOption = LLMInitOption(
kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
nCtx = 4096,
),
)Automatic selection also accepts initOption. If you need to force GPU or NPU, switch to the explicit constructor below because apType is not configurable in this path.
Explicit Runtime Selection
Use this constructor when you want to choose the runtime family, GGUF quantization, and processor type directly.
ZeticMLangeLLMModel(
context: Context,
personalKey: String,
name: String,
version: Int? = null,
target: LLMTarget,
quantType: LLMQuantType,
apType: APType = APType.CPU,
onProgress: ((Float) -> Unit)? = null,
onStatusChanged: ((ModelLoadingStatus) -> Unit)? = null,
cacheHandlingPolicy: ModelCacheHandlingPolicy = ModelCacheHandlingPolicy.REMOVE_OVERLAPPING,
initOption: LLMInitOption = LLMInitOption(),
)| Parameter | Type | Default | Description |
|---|---|---|---|
context | Context | - | Android context used for cache and file access. |
personalKey | String | - | Personal key. See Personal Key. |
name | String | - | Pre-built model key or Hugging Face repository ID. |
version | Int? | null | Model version. null loads the latest version. |
target | LLMTarget | - | Runtime family to load. Use LLMTarget.LLAMA_CPP. |
quantType | LLMQuantType | - | GGUF quantization to load. |
apType | APType | CPU | Processor type for the selected runtime. |
onProgress | ((Float) -> Unit)? | null | Download progress callback from 0.0 to 1.0. |
onStatusChanged | ((ModelLoadingStatus) -> Unit)? | null | Loading status callback. |
cacheHandlingPolicy | ModelCacheHandlingPolicy | REMOVE_OVERLAPPING | Managed artifact cache policy. |
initOption | LLMInitOption | LLMInitOption() | LLM initialization options such as KV-cache cleanup and requested context length. |
Detailed cacheHandlingPolicy behavior and ModelCacheManager usage are currently TBD. See Cache Management.
val model = ZeticMLangeLLMModel(
context = context,
personalKey = PERSONAL_KEY,
name = "google/gemma-3-4b-it",
target = LLMTarget.LLAMA_CPP,
quantType = LLMQuantType.GGUF_QUANT_Q4_K_M,
apType = APType.GPU,
initOption = LLMInitOption(
kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.DO_NOT_CLEAN_UP,
nCtx = 4096,
),
)initOption
initOption now contains LLM runtime initialization settings.
data class LLMInitOption(
val kvCacheCleanupPolicy: LLMKVCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
val nCtx: Int = 2048,
)| Field | Type | Default | Description |
|---|---|---|---|
kvCacheCleanupPolicy | LLMKVCacheCleanupPolicy | CLEAN_UP_ON_FULL | Conversation KV-cache policy. |
nCtx | Int | 2048 | Requested context length. |
cacheHandlingPolicy and initOption.kvCacheCleanupPolicy are different settings. cacheHandlingPolicy controls downloaded model artifacts on disk. kvCacheCleanupPolicy controls the in-memory conversation KV cache during generation.
More detailed managed cache behavior is documented as TBD in Cache Management.
nCtx is a requested value, not an exact guarantee. The runtime can normalize it internally depending on the model, backend, or device.
apType Support
apType is relevant when you use the explicit constructor and choose target = LLMTarget.LLAMA_CPP.
| Device / runtime | Supported apType |
|---|---|
Qualcomm Android + LLAMA_CPP | CPU, GPU, NPU |
Other Android devices + LLAMA_CPP | CPU |
For current Android LLaMA.cpp usage, non-Qualcomm devices should use APType.CPU.
Methods
run(text)
Starts generation for a prompt.
fun run(text: String): LLMRunResult| Parameter | Type | Description |
|---|---|---|
text | String | Prompt text to start generation with. |
Returns: LLMRunResult
| Property | Type | Description |
|---|---|---|
status | Int | Native status code. |
promptTokens | Int | Number of prompt tokens consumed. |
waitForNextToken()
Blocks until the next token is available.
fun waitForNextToken(): LLMNextTokenResultReturns: LLMNextTokenResult
| Property | Type | Description |
|---|---|---|
status | Int | Native status code. |
token | String | Generated token text. |
generatedTokens | Int | Number of generated tokens so far. 0 means generation is complete. |
cleanUp()
Resets the current conversation state without destroying the model instance.
fun cleanUp()If you use LLMKVCacheCleanupPolicy.DO_NOT_CLEAN_UP, call cleanUp() before starting the next conversation.
deinit()
Fully releases the underlying target model.
fun deinit()Multimodal (Beta)
Multimodal embedding injection is Beta in ZeticMLange Android 1.7.0-beta.1. See LLM Inference: Multimodal for design background and the Audio Understanding tutorial for an end-to-end example.
These methods are supported only when the loaded target is the llama.cpp backend (i.e. implements ZeticMLangeMultimodalCapable). They throw with a clear message on other backends.
runWithEmbeddings(embeddings)
Prefill the decoder with a flat embedding sequence (e.g. audio encoder output, or a chat template assembled by the SDK layer). Positions continue from the current KV-cache length, so this composes with prior run() / runWithEmbeddings() turns.
fun runWithEmbeddings(embeddings: FloatArray)| Parameter | Type | Description |
|---|---|---|
embeddings | FloatArray | Flat embedding buffer. Length must be a multiple of the model's embedding dimension; the SDK validates and rejects mismatched buffers. |
After this call returns, the embedding batch is queued. Drive token decode + sampling with waitForNextToken() as you would for run(text).
tokenize(text, parseSpecial)
Tokenize text using the model's vocabulary. With parseSpecial = true, special tokens (e.g. <|audio_bos|>, <|im_start|>) in the input are recognized as single tokens rather than split by BPE.
fun tokenize(text: String, parseSpecial: Boolean): IntArray| Parameter | Type | Description |
|---|---|---|
text | String | Text to tokenize. |
parseSpecial | Boolean | When true, recognize special-token literal forms in the input. |
Returns: IntArray of token ids. Empty on failure.
tokenEmbeddings(tokenIds)
Look up per-token embedding vectors from the model's tok_embd tensor and return them concatenated into a flat [tokenIds.size * n_embd] buffer. Quantized rows are dequantized to float32.
fun tokenEmbeddings(tokenIds: IntArray): FloatArray| Parameter | Type | Description |
|---|---|---|
tokenIds | IntArray | Token ids to look up. |
Returns: FloatArray of length tokenIds.size * n_embd. Empty on failure.
specialTokenId(name)
Resolve a special token by its surface form (e.g. "<|audio_bos|>") to its vocabulary id.
fun specialTokenId(name: String): Int| Parameter | Type | Description |
|---|---|---|
name | String | Special token surface form. |
Returns: Token id, or -1 if the string does not resolve to a single special token in this model's vocab.
Multimodal Helpers
The SDK ships supporting types in com.zeticai.mlange.core.model.multimodal:
| Symbol | Purpose |
|---|---|
MultimodalProfile | Declares required special tokens for a multimodal model (e.g. MultimodalProfile.QWEN_OMNI_AUDIO). |
ZeticMLangeLLMModel.validate(profile) | Init-time check that the loaded model carries every required token. Throws with a clear message naming missing markers. |
QwenOmniAudioChatTemplate | Builds a flat audio-prompt embedding buffer ready for runWithEmbeddings. |
ZeticMLangeMultimodalCapable | Capability interface implemented by llama.cpp targets that support these methods. Used internally to gate runWithEmbeddings and friends. |
See the Multimodal page for usage examples.
Full Examples
Automatic Selection
import com.zeticai.mlange.core.model.llm.LLMInitOption
import com.zeticai.mlange.core.model.llm.LLMKVCacheCleanupPolicy
import com.zeticai.mlange.core.model.llm.LLMModelMode
import com.zeticai.mlange.core.model.llm.ZeticMLangeLLMModel
val model = ZeticMLangeLLMModel(
context = context,
personalKey = PERSONAL_KEY,
name = "google/gemma-3-4b-it",
modelMode = LLMModelMode.RUN_AUTO,
initOption = LLMInitOption(
kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
nCtx = 4096,
),
)
model.run("Explain on-device AI in one paragraph.")
val sb = StringBuilder()
while (true) {
val result = model.waitForNextToken()
if (result.generatedTokens == 0) break
if (result.token.isNotEmpty()) sb.append(result.token)
}
val output = sb.toString()
model.cleanUp()
model.deinit()Explicit Qualcomm Selection
import com.zeticai.mlange.core.model.APType
import com.zeticai.mlange.core.model.llm.LLMInitOption
import com.zeticai.mlange.core.model.llm.LLMKVCacheCleanupPolicy
import com.zeticai.mlange.core.model.llm.LLMQuantType
import com.zeticai.mlange.core.model.llm.LLMTarget
import com.zeticai.mlange.core.model.llm.ZeticMLangeLLMModel
val model = ZeticMLangeLLMModel(
context = context,
personalKey = PERSONAL_KEY,
name = "changgeun/tiny-llama",
target = LLMTarget.LLAMA_CPP,
quantType = LLMQuantType.GGUF_QUANT_Q4_K_M,
apType = APType.NPU,
initOption = LLMInitOption(
kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.DO_NOT_CLEAN_UP,
nCtx = 4096,
),
)See Also
- ZeticMLangeLLMModel (iOS): iOS equivalent
- LLM Inference Overview: Recommended initialization paths
- Streaming Token Generation: Streaming patterns and cleanup
- Enums and Constants: LLM enums and config types