LLM Inference Modes

LLMModelMode controls the automatic selection strategy used by ZeticMLangeLLMModel.

Available Modes

Mode	Purpose
`RUN_AUTO` / `runAuto`	Default strategy. Lets the SDK select the best available runtime and quantization for the device.
`RUN_SPEED` / `runSpeed`	Prioritizes lower latency.
`RUN_ACCURACY` / `runAccuracy`	Prioritizes better accuracy when multiple candidates are available.

API Usage

val modelSpeed = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = MODEL_NAME,
    modelMode = LLMModelMode.RUN_SPEED,
)

val modelForNpu = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = MODEL_NAME,
    modelMode = LLMModelMode.RUN_AUTO,
    apType = APType.NPU,
)

let modelSpeed = try await ZeticMLangeLLMModel(
  personalKey: PERSONAL_KEY,
  name: MODEL_NAME,
  modelMode: .RUN_SPEED
)

let modelForGpu = try await ZeticMLangeLLMModel(
  personalKey: PERSONAL_KEY,
  name: MODEL_NAME,
  modelMode: .RUN_AUTO,
  apType: .GPU
)

final modelSpeed = await ZeticMLangeLLMModel.create(
  personalKey: personalKey,
  name: modelName,
  modelMode: LLMModelMode.runSpeed,
);

final modelForGpu = await ZeticMLangeLLMModel.create(
  personalKey: personalKey,
  name: modelName,
  modelMode: LLMModelMode.runAuto,
  apType: APType.gpu,
);

Processor and Quantization Filters

apType and quantType narrow the candidates considered by automatic selection.

val model = ZeticMLangeLLMModel(
    context = context,
    personalKey = PERSONAL_KEY,
    name = MODEL_NAME,
    modelMode = LLMModelMode.RUN_AUTO,
    apType = APType.GPU,
    quantType = LLMQuantType.GGUF_QUANT_Q4_K_M,
)

Only request filters that are available for your model and target devices. Unsupported combinations fail during initialization.

LLM Inference Modes

Available Modes

API Usage

Processor and Quantization Filters

Next Steps

On this page