LLM Inference Engine
Immediate on-device LLM implementation via Unified HAL
Instant On-Device LLM
Implementing efficient LLMs on mobile devices involves handling complex memory management, tensor offloading, and backend fragmentation.
We are continuously updating our support list. New architectures are validated weekly.
For the most up-to-date capabilities and comprehensive use cases, please visit: MLange Dashboard
Implementation Pipeline
Model Artifact Selection
Compatible inputs for the MLange LLM Engine include:
-
Pre-built Models:
Select a ready-to-use model from our dashboard: MLange Dashboard. -
Hugging Face Repository ID:
e.g.,google/gemma-3-4b-itorLiquidAI/LFM2.5-1.2B-Instruct.Currently supports public repositories with permissive open-source licenses. Private repository authentication is on the roadmap.
Instant Provisioning
Skip the complex setup.
- Initialize your workspace via the Web Dashboard.
- Generate a Personal Key to authenticate your client.
That's it. Our backend handles the model distribution and optimization.
Runtime Initialization & Token Streaming
Initialize the engine and start streaming tokens immediately.
Refer to the Android Integration Guide.
val model = ZeticMLangeLLMModel(context, PERSONAL_KEY, modelKey, LLMModelMode.RUN_REGULAR)
// Initiate generation context
model.run("prompt")
// Asynchronous Token Consumption Loop
while (true) {
val token = model.waitForNextToken()
if (token == "") break
// Append token to stream buffer
}Refer to the iOS Integration Guide.
let model = ZeticMLangeLLMModel(PERSONAL_KEY, modelKey, .RUN_REGULAR)
// Initiate generation context
model.run("prompt")
while true {
let token = model.waitForNextToken()
if token == "" {
break
}
// Append token to stream bufferAPI Reference
Initialization
Option 1: Automatic configuration (Recommended)
init(personalKey: String, modelKey: String)Automatically downloads and initializes the model with default settings optimized for the device.
Parameters:
personalKey: Your personal API keymodelKey: Identifier for the model to download
Option 2: Custom configuration (Advanced)
init(
personalKey: String,
modelKey: String,
modelMode: LLMModelMode,
dataSetType: LLMDataSetType,
kvCacheCleanupPolicy: LLMKVCacheCleanupPolicy = LLMKVCacheCleanupPolicy.CLEAN_UP_ON_FULL,
onProgress: ((Float) -> Unit)? = null
)Downloads and initializes the model with custom configuration for fine-tuned control.
Parameters:
personalKey: Your personal API keymodelKey: Identifier for the model to downloadmodelMode: (Optional) LLM inference mode for device-appropriate backend selectiondataSetType: (Optional) Type of dataset to use for the modelkvCacheCleanupPolicy: (Optional) Policy for handling KV cache when full. Defaults toCLEAN_UP_ON_FULL-
CLEAN_UP_ON_FULL: Clears the entire context when KV cache is full -
DO_NOT_CLEAN_UP: Keeps the context without cleanup when KV cache is fullRunning
run()again without callingcleanUp()may cause unexpected behavior or bugs
-
onProgress: (Optional) Callback function that reports model download progress as aFloatvalue (0.0 to 1.0)
Example:
// Monitor download progress
init(
personalKey = "your_key",
modelKey = "model_key",
modelMode = LLMModelMode.RUN_REGULAR,
dataSetType = LLMDataSetType.DEFAULT,
onProgress = { progress ->
println("Download progress: ${(progress * 100).toInt()}%")
}
)For more information about mode selection, please follow the LLM Inference Mode Select page.
Context Management
Initiate Generation Context
run(prompt: String)Starts a conversation with the provided prompt.
Consume Next Token
waitForNextToken(): StringReturns the next generated token. Empty string indicates completion.
Clean the context
cleanUp()Cleans up the context of the running model.
Integrate ZETIC.MLange LLM into your project
Quick Start Templates
Build a complete chat app with just your PERSONAL_KEY and PROJECT_NAME. Check each repository's README for detailed instructions.
(Coming Soon) Unified Backend HAL
The Unified Hardware Abstraction Layer is currently in active development and will be released in the near future.
ZETIC.MLange abstracts this entire complexity into a robust Unified Hardware Abstraction Layer (HAL).
We orchestrated the heavy lifting—memory mapping, KV-cache lifecycle, and NPU offloading—so you can implement SOTA LLMs immediately with a clear, simple API.
- Multi-Backend Orchestration: Seamlessly abstracts LLaMA.cpp and vendor-specific NPU runtimes.
- Automated Lifecycle Management: Zero manual memory management required.
- Consistent Interface: Write once, run on any accelerated backend.