Melange
API ReferenceAndroid

ZeticMLangeLLMModel

API reference for running LLM inference on Android with ZeticMLangeLLMModel.

This page reflects ZeticMLange Android 1.5.9.

ZeticMLangeLLMModel is the Android entry point for on-device LLM inference. The current API has two constructor families:

  • Automatic selection by LLMModelMode
  • Explicit selection by LLMTarget, LLMQuantType, and APType

Use the automatic constructor first. Use the explicit constructor when you need fixed GGUF selection or processor control.

Package

com.zeticai.mlange.core.model.llm

Import

import com.zeticai.mlange.core.model.llm.ZeticMLangeLLMModel

Constructors

This constructor selects the runtime and quantization automatically from model metadata.

ZeticMLangeLLMModel(
    context: Context,
    personalKey: String,
    name: String,
    version: Int? = null,
    modelMode: LLMModelMode = LLMModelMode.RUN_AUTO,
    dataSetType: LLMDataSetType? = null,
    onProgress: ((Float) -> Unit)? = null,
    onStatusChanged: ((ModelLoadingStatus) -> Unit)? = null,
    cacheHandlingPolicy: ModelCacheHandlingPolicy = ModelCacheHandlingPolicy.REMOVE_OVERLAPPING,
    initOption: LLMInitOption = LLMInitOption(),
)
ParameterTypeDefaultDescription
contextContext-Android context used for cache and file access.
personalKeyString-Personal key from the Melange Dashboard.
nameString-Pre-built model key or Hugging Face repository ID.
versionInt?nullModel version. null loads the latest version.
modelModeLLMModelModeRUN_AUTOAutomatic selection strategy.
dataSetTypeLLMDataSetType?nullOptional dataset hint for accuracy-oriented selection.
onProgress((Float) -> Unit)?nullDownload progress callback from 0.0 to 1.0.
onStatusChanged((ModelLoadingStatus) -> Unit)?nullLoading status callback for asset-pack or download state changes.
cacheHandlingPolicyModelCacheHandlingPolicyREMOVE_OVERLAPPINGManaged artifact cache policy.
initOptionLLMInitOptionLLMInitOption()LLM initialization options such as KV-cache cleanup and requested context length.

Detailed cacheHandlingPolicy behavior and ModelCacheManager usage are currently TBD. See Cache Management.

val model = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = "google/gemma-3-4b-it",
    modelMode = LLMModelMode.RUN_AUTO,
    initOption = LLMInitOption(
        kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
        nCtx = 4096,
    ),
)

Automatic selection also accepts initOption. If you need to force GPU or NPU, switch to the explicit constructor below because apType is not configurable in this path.

Explicit Runtime Selection

Use this constructor when you want to choose the runtime family, GGUF quantization, and processor type directly.

ZeticMLangeLLMModel(
    context: Context,
    personalKey: String,
    name: String,
    version: Int? = null,
    target: LLMTarget,
    quantType: LLMQuantType,
    apType: APType = APType.CPU,
    onProgress: ((Float) -> Unit)? = null,
    onStatusChanged: ((ModelLoadingStatus) -> Unit)? = null,
    cacheHandlingPolicy: ModelCacheHandlingPolicy = ModelCacheHandlingPolicy.REMOVE_OVERLAPPING,
    initOption: LLMInitOption = LLMInitOption(),
)
ParameterTypeDefaultDescription
contextContext-Android context used for cache and file access.
personalKeyString-Personal key from the Melange Dashboard.
nameString-Pre-built model key or Hugging Face repository ID.
versionInt?nullModel version. null loads the latest version.
targetLLMTarget-Runtime family to load. Use LLMTarget.LLAMA_CPP.
quantTypeLLMQuantType-GGUF quantization to load.
apTypeAPTypeCPUProcessor type for the selected runtime.
onProgress((Float) -> Unit)?nullDownload progress callback from 0.0 to 1.0.
onStatusChanged((ModelLoadingStatus) -> Unit)?nullLoading status callback.
cacheHandlingPolicyModelCacheHandlingPolicyREMOVE_OVERLAPPINGManaged artifact cache policy.
initOptionLLMInitOptionLLMInitOption()LLM initialization options such as KV-cache cleanup and requested context length.

Detailed cacheHandlingPolicy behavior and ModelCacheManager usage are currently TBD. See Cache Management.

val model = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = "google/gemma-3-4b-it",
    target = LLMTarget.LLAMA_CPP,
    quantType = LLMQuantType.GGUF_QUANT_Q4_K_M,
    apType = APType.GPU,
    initOption = LLMInitOption(
        kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.DO_NOT_CLEAN_UP,
        nCtx = 4096,
    ),
)

initOption

initOption now contains LLM runtime initialization settings.

data class LLMInitOption(
    val kvCacheCleanupPolicy: LLMKVCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
    val nCtx: Int = 2048,
)
FieldTypeDefaultDescription
kvCacheCleanupPolicyLLMKVCacheCleanupPolicyCLEAN_UP_ON_FULLConversation KV-cache policy.
nCtxInt2048Requested context length.

cacheHandlingPolicy and initOption.kvCacheCleanupPolicy are different settings. cacheHandlingPolicy controls downloaded model artifacts on disk. kvCacheCleanupPolicy controls the in-memory conversation KV cache during generation.

More detailed managed cache behavior is documented as TBD in Cache Management.

nCtx is a requested value, not an exact guarantee. The runtime can normalize it internally depending on the model, backend, or device.

apType Support

apType is relevant when you use the explicit constructor and choose target = LLMTarget.LLAMA_CPP.

Device / runtimeSupported apType
Qualcomm Android + LLAMA_CPPCPU, GPU, NPU
Other Android devices + LLAMA_CPPCPU

For current Android LLaMA.cpp usage, non-Qualcomm devices should use APType.CPU.

Methods

run(text)

Starts generation for a prompt.

fun run(text: String): LLMRunResult
ParameterTypeDescription
textStringPrompt text to start generation with.

Returns: LLMRunResult

PropertyTypeDescription
statusIntNative status code.
promptTokensIntNumber of prompt tokens consumed.

waitForNextToken()

Blocks until the next token is available.

fun waitForNextToken(): LLMNextTokenResult

Returns: LLMNextTokenResult

PropertyTypeDescription
statusIntNative status code.
tokenStringGenerated token text.
generatedTokensIntNumber of generated tokens so far. 0 means generation is complete.

cleanUp()

Resets the current conversation state without destroying the model instance.

fun cleanUp()

If you use LLMKVCacheCleanupPolicy.DO_NOT_CLEAN_UP, call cleanUp() before starting the next conversation.

deinit()

Fully releases the underlying target model.

fun deinit()

Full Examples

Automatic Selection

val model = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = "google/gemma-3-4b-it",
    modelMode = LLMModelMode.RUN_AUTO,
    initOption = LLMInitOption(
        kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
        nCtx = 4096,
    ),
)

model.run("Explain on-device AI in one paragraph.")

val sb = StringBuilder()
while (true) {
    val result = model.waitForNextToken()
    if (result.generatedTokens == 0) break
    if (result.token.isNotEmpty()) sb.append(result.token)
}

val output = sb.toString()
model.cleanUp()
model.deinit()

Explicit Qualcomm Selection

val model = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = "changgeun/tiny-llama",
    target = LLMTarget.LLAMA_CPP,
    quantType = LLMQuantType.GGUF_QUANT_Q4_K_M,
    apType = APType.NPU,
    initOption = LLMInitOption(
        kvCacheCleanupPolicy = LLMKVCacheCleanupPolicy.DO_NOT_CLEAN_UP,
        nCtx = 4096,
    ),
)

See Also

On this page