LLM Inference Overview

Examples on this page reflect ZeticMLange Android 1.9.0, ZeticMLange iOS 1.9.0, and zetic_mlange 1.9.1.

ZeticMLangeLLMModel runs text generation on-device and exposes token streaming through waitForNextToken(). In 1.9.x, the public LLM surface also includes function calling, composition-based RAG, vision-language image response, and KV state persistence on native Android/iOS.

Load A Model

Use a model name from the Melange Dashboard or a supported public Hugging Face model name.

val model = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = MODEL_NAME,
    modelMode = LLMModelMode.RUN_AUTO,
    initOption = LLMInitOption(nCtx = 4096),
)

let model = try await ZeticMLangeLLMModel(
  personalKey: PERSONAL_KEY,
  name: MODEL_NAME,
  modelMode: .RUN_AUTO,
  initOption: LLMInitOption(nCtx: 4096)
)

final model = await ZeticMLangeLLMModel.create(
  personalKey: personalKey,
  name: modelName,
  modelMode: LLMModelMode.runAuto,
  initOption: const LLMInitOption(nCtx: 4096),
);

Generate Text

run(...) starts generation. Read tokens with waitForNextToken() until the stream is finished.

model.run("What is on-device AI?")

val output = StringBuilder()
while (true) {
    val next = model.waitForNextToken()
    if (next.isFinal || next.token.isEmpty()) break
    output.append(next.token)
}

try model.run("What is on-device AI?")

var output = ""
while true {
  let next = model.waitForNextToken()
  if next.isFinished { break }
  output.append(next.token)
}

model.run('What is on-device AI?');

final output = StringBuffer();
while (true) {
  final next = model.waitForNextToken();
  if (next.isFinished) break;
  output.write(next.token);
}

Selection Controls

Use modelMode first. Add apType or quantType only when you need to filter automatic backend selection.

Option	Purpose
`modelMode`	Selects automatic strategy: auto, speed, or accuracy.
`apType`	Filters by processor type such as CPU, GPU, or NPU when supported.
`quantType`	Filters by supported LLM quantization type.
`initOption.nCtx`	Requests the context size. The runtime may normalize it.
`cacheHandlingPolicy`	Controls downloaded model artifact cleanup on disk.

apType and quantType are filters. If the requested combination is not available for the selected model/device, initialization can fail.

LLM Inference Overview

Load A Model

Generate Text

Selection Controls

1.9.0 Capabilities

Function Calling

RAG

Vision-Language

Quick Start Templates

Android (Kotlin)

iOS (Swift)

Flutter

React Native

Next Steps

On this page