Melange
API ReferenceiOS

ZeticMLangeLLMModel

API reference for running LLM inference on iOS with ZeticMLangeLLMModel.

This page reflects ZeticMLange iOS 1.5.11.

ZeticMLangeLLMModel is the iOS entry point for on-device LLM inference. The current API has two initializer families:

  • Automatic selection by LLMModelMode
  • Explicit selection by LLMTarget, LLMQuantType, and APType

Use the automatic initializer first. Use the explicit initializer when you need a fixed GGUF quantization or want to force Apple CPU or GPU.

Import

import ZeticMLange

Initializers

This initializer selects the runtime and quantization automatically from model metadata.

init(
    personalKey: String,
    name: String,
    version: Int? = nil,
    modelMode: LLMModelMode = .RUN_AUTO,
    dataSetType: LLMDataSetType? = nil,
    cacheHandlingPolicy: ZeticMLangeCacheHandlingPolicy = .REMOVE_OVERLAPPING,
    initOption: LLMInitOption = LLMInitOption(),
    onDownload: ((Float) -> Void)? = nil
) throws
ParameterTypeDefaultDescription
personalKeyString-Personal key from the Melange Dashboard.
nameString-Pre-built model key or Hugging Face repository ID.
versionInt?nilModel version. nil loads the latest version.
modelModeLLMModelMode.RUN_AUTOAutomatic selection strategy.
dataSetTypeLLMDataSetType?nilOptional dataset hint for accuracy-oriented selection.
cacheHandlingPolicyZeticMLangeCacheHandlingPolicy.REMOVE_OVERLAPPINGManaged artifact cache policy.
initOptionLLMInitOptionLLMInitOption()LLM initialization options such as KV-cache cleanup and requested context length.
onDownload((Float) -> Void)?nilDownload progress callback from 0.0 to 1.0.

Detailed cacheHandlingPolicy behavior and ModelCacheManager usage are currently TBD. See Cache Management.

let model = try ZeticMLangeLLMModel(
    personalKey: PERSONAL_KEY,
    name: "google/gemma-3-4b-it",
    modelMode: .RUN_AUTO,
    initOption: LLMInitOption(
        kvCacheCleanupPolicy: .CLEAN_UP_ON_FULL,
        nCtx: 4096
    )
)

Automatic selection also accepts initOption. If you need to force Apple GPU, switch to the explicit initializer below because apType is not configurable in this path.

Explicit Runtime Selection

Use this initializer when you want to choose the runtime family, GGUF quantization, and processor type directly.

init(
    personalKey: String,
    name: String,
    version: Int? = nil,
    target: LLMTarget,
    quantType: LLMQuantType,
    apType: APType = .CPU,
    cacheHandlingPolicy: ZeticMLangeCacheHandlingPolicy = .REMOVE_OVERLAPPING,
    initOption: LLMInitOption = LLMInitOption(),
    onDownload: ((Float) -> Void)? = nil
) throws
ParameterTypeDefaultDescription
personalKeyString-Personal key from the Melange Dashboard.
nameString-Pre-built model key or Hugging Face repository ID.
versionInt?nilModel version. nil loads the latest version.
targetLLMTarget-Runtime family to load. Use .LLAMA_CPP.
quantTypeLLMQuantType-GGUF quantization to load.
apTypeAPType.CPUProcessor type for the selected runtime.
cacheHandlingPolicyZeticMLangeCacheHandlingPolicy.REMOVE_OVERLAPPINGManaged artifact cache policy.
initOptionLLMInitOptionLLMInitOption()LLM initialization options such as KV-cache cleanup and requested context length.
onDownload((Float) -> Void)?nilDownload progress callback from 0.0 to 1.0.

Detailed cacheHandlingPolicy behavior and ModelCacheManager usage are currently TBD. See Cache Management.

let model = try ZeticMLangeLLMModel(
    personalKey: PERSONAL_KEY,
    name: "google/gemma-3-4b-it",
    target: .LLAMA_CPP,
    quantType: .GGUF_QUANT_Q4_K_M,
    apType: .GPU,
    initOption: LLMInitOption(
        kvCacheCleanupPolicy: .DO_NOT_CLEAN_UP,
        nCtx: 4096
    )
)

initOption

initOption now contains LLM runtime initialization settings.

public struct LLMInitOption {
    public let kvCacheCleanupPolicy: LLMKVCacheCleanupPolicy
    public let nCtx: Int
}
FieldTypeDefaultDescription
kvCacheCleanupPolicyLLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULLConversation KV-cache policy.
nCtxInt2048Requested context length.

cacheHandlingPolicy and initOption.kvCacheCleanupPolicy are different settings. cacheHandlingPolicy controls downloaded model artifacts on disk. kvCacheCleanupPolicy controls the in-memory conversation KV cache during generation.

More detailed managed cache behavior is documented as TBD in Cache Management.

nCtx is a requested value, not an exact guarantee. The runtime can normalize it internally depending on the model, backend, or device.

apType Support

apType is relevant when you use the explicit initializer and choose target = .LLAMA_CPP.

Device / runtimeSupported apType
Apple + .LLAMA_CPP.CPU, .GPU

Apple LLaMA.cpp does not support .NPU. Use .CPU or .GPU.

Methods

run(_:)

Starts generation for a prompt.

func run(_ text: String) throws -> LLMRunResult
ParameterTypeDescription
textStringPrompt text to start generation with.

Returns: LLMRunResult

PropertyTypeDescription
promptTokensIntNumber of prompt tokens consumed.

waitForNextToken()

Blocks until the next token is available.

func waitForNextToken() -> LLMNextTokenResult

Returns: LLMNextTokenResult

PropertyTypeDescription
tokenStringGenerated token text.
generatedTokensIntNumber of generated tokens so far.
codeIntNative status code.

cleanUp()

Resets the current conversation state without destroying the model instance.

func cleanUp() throws

If you use .DO_NOT_CLEAN_UP, call cleanUp() before starting the next conversation.

forceDeinit()

Fully releases the underlying target model.

func forceDeinit()

Full Examples

Automatic Selection

let model = try ZeticMLangeLLMModel(
    personalKey: PERSONAL_KEY,
    name: "google/gemma-3-4b-it",
    modelMode: .RUN_AUTO,
    initOption: LLMInitOption(
        kvCacheCleanupPolicy: .CLEAN_UP_ON_FULL,
        nCtx: 4096
    )
)

try model.run("Explain on-device AI in one paragraph.")

var output = ""
while true {
    let result = model.waitForNextToken()
    if result.generatedTokens == 0 { break }
    output.append(result.token)
}

try model.cleanUp()
model.forceDeinit()

Explicit Apple GPU Selection

let model = try ZeticMLangeLLMModel(
    personalKey: PERSONAL_KEY,
    name: "google/gemma-3-4b-it",
    target: .LLAMA_CPP,
    quantType: .GGUF_QUANT_Q4_K_M,
    apType: .GPU,
    initOption: LLMInitOption(
        kvCacheCleanupPolicy: .DO_NOT_CLEAN_UP,
        nCtx: 4096
    )
)

Notes

The initializer can download model artifacts on first use. Create the model off the main thread if you want to avoid blocking the UI.

See Also

On this page