ZETIC.MLange

LLM Inference Engine

Immediate on-device LLM implementation via Unified HAL

Instant On-Device LLM

Implementing efficient LLMs on mobile devices involves handling complex memory management, tensor offloading, and backend fragmentation.

We are continuously updating our support list. New architectures are validated weekly.

For the most up-to-date capabilities and comprehensive use cases, please visit: MLange Dashboard

Implementation Pipeline

Model Artifact Selection

Compatible inputs for the MLange LLM Engine include:

  1. Pre-built Models:
    Select a ready-to-use model from our dashboard: MLange Dashboard.

  2. Hugging Face Repository ID:
    e.g., google/gemma-3-4b-it or LiquidAI/LFM2.5-1.2B-Instruct.

    Currently supports public repositories with permissive open-source licenses. Private repository authentication is on the roadmap.

Instant Provisioning

Skip the complex setup.

  1. Initialize your workspace via the Web Dashboard.
  2. Generate a Personal Key to authenticate your client.

That's it. Our backend handles the model distribution and optimization.

Runtime Initialization & Token Streaming

Initialize the engine and start streaming tokens immediately.

Refer to the Android Integration Guide.

val model = ZeticMLangeLLMModel(context, PERSONAL_KEY, modelKey, LLMModelMode.RUN_REGULAR)

// Initiate generation context
model.run("prompt")

// Asynchronous Token Consumption Loop
while (true) {
    val token = model.waitForNextToken()
    
    if (token == "") break

    // Append token to stream buffer
}

Refer to the iOS Integration Guide.

let model = ZeticMLangeLLMModel(PERSONAL_KEY, modelKey, .RUN_REGULAR)

// Initiate generation context
model.run("prompt")

while true {
    let token = model.waitForNextToken()
    
    if token == "" {
        break
    }
    // Append token to stream buffer

API Reference

Initialization

Option 1: Automatic configuration (Recommended)

init(personalKey: String, modelKey: String)

Automatically downloads and initializes the model with default settings optimized for the device.

Parameters:

  • personalKey: Your personal API key
  • modelKey: Identifier for the model to download

Option 2: Custom configuration (Advanced)

init(
    personalKey: String, 
    modelKey: String, 
    modelMode: LLMModelMode, 
    dataSetType: LLMDataSetType,
    kvCacheCleanupPolicy: LLMKVCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
    onProgress: ((Float) -> Unit)? = null
)

Downloads and initializes the model with custom configuration for fine-tuned control.

Parameters:

  • personalKey: Your personal API key
  • modelKey: Identifier for the model to download
  • modelMode: (Optional) LLM inference mode for device-appropriate backend selection
  • dataSetType: (Optional) Type of dataset to use for the model
  • kvCacheCleanupPolicy: (Optional) Policy for handling KV cache when full. Defaults to CLEAN_UP_ON_FULL
    • CLEAN_UP_ON_FULL: Clears the entire context when KV cache is full

    • DO_NOT_CLEAN_UP: Keeps the context without cleanup when KV cache is full

      Running run() again without calling cleanUp() may cause unexpected behavior or bugs

  • onProgress: (Optional) Callback function that reports model download progress as a Float value (0.0 to 1.0)

Example:

// Monitor download progress
init(
    personalKey = "your_key",
    modelKey = "model_key",
    modelMode = LLMModelMode.RUN_REGULAR,
    dataSetType = LLMDataSetType.DEFAULT,
    onProgress = { progress ->
        println("Download progress: ${(progress * 100).toInt()}%")
    }
)

For more information about mode selection, please follow the LLM Inference Mode Select page.

Context Management

Initiate Generation Context

run(prompt: String)

Starts a conversation with the provided prompt.

Consume Next Token

waitForNextToken(): String

Returns the next generated token. Empty string indicates completion.

Clean the context

cleanUp()

Cleans up the context of the running model.

Integrate ZETIC.MLange LLM into your project

Quick Start Templates

Build a complete chat app with just your PERSONAL_KEY and PROJECT_NAME. Check each repository's README for detailed instructions.

(Coming Soon) Unified Backend HAL

The Unified Hardware Abstraction Layer is currently in active development and will be released in the near future.

ZETIC.MLange abstracts this entire complexity into a robust Unified Hardware Abstraction Layer (HAL).

We orchestrated the heavy lifting—memory mapping, KV-cache lifecycle, and NPU offloading—so you can implement SOTA LLMs immediately with a clear, simple API.

  • Multi-Backend Orchestration: Seamlessly abstracts LLaMA.cpp and vendor-specific NPU runtimes.
  • Automated Lifecycle Management: Zero manual memory management required.
  • Consistent Interface: Write once, run on any accelerated backend.