LLM Inference
LLM Inference Overview
Run large language models on-device with ZETIC Melange.
Melange provides instant on-device LLM deployment, abstracting the complexity of memory management, tensor offloading, and backend fragmentation into a simple API.
How It Works
- Select a model: Choose from pre-built models on the Melange Dashboard or use a Hugging Face Repository ID.
- Initialize: Set up your workspace via the Dashboard and generate a Personal Key.
- Stream tokens: Initialize the LLM engine and start streaming tokens in your app.
Supported Input Sources
- Pre-built Models: Select a ready-to-use model from the Melange Dashboard.
- Hugging Face Repository ID: Use models like
google/gemma-3-4b-itorLiquidAI/LFM2.5-1.2B-Instruct.
Currently supports public repositories with permissive open-source licenses. Private repository authentication is on the roadmap.
Quick Example
val model = ZeticMLangeLLMModel(context, PERSONAL_KEY, MODEL_NAME)
// Start generation
model.run("What is on-device AI?")
val sb = StringBuilder()
while (true) {
val result = model.waitForNextToken()
if (result.generatedTokens == 0) break
sb.append(result.token)
}
val output = sb.toString()let model = try ZeticMLangeLLMModel(personalKey: PERSONAL_KEY, name: MODEL_NAME)
// Start generation
try model.run("What is on-device AI?")
var buffer = ""
while true {
let result = model.waitForNextToken()
if result.generatedTokens == 0 { break }
buffer.append(result.token)
}
let output = bufferQuick Start Templates
Build a complete chat app with just your PERSONAL_KEY and MODEL_NAME:
Check each repository's README for detailed setup instructions.
API Reference
Initialization
Automatic configuration (Recommended):
init(personalKey: String, name: String)Automatically downloads and initializes the model with default settings optimized for the device.
Custom configuration (Advanced):
init(
personalKey: String,
name: String,
version: String? = null,
modelMode: LLMModelMode,
dataSetType: LLMDataSetType,
kvCacheCleanupPolicy: LLMKVCacheCleanupPolicy = CLEAN_UP_ON_FULL,
onProgress: ((Float) -> Unit)? = null
)Context Management
| Method | Description |
|---|---|
run(prompt) | Starts a conversation with the provided prompt. Returns LLMRunResult. |
waitForNextToken() | Returns the next generated token. Empty string indicates completion. |
cleanUp() | Cleans up the context of the running model. |
Next Steps
- Streaming Token Generation: Detailed streaming implementation
- Supported LLM Models: Available models
- LLM Inference Modes: Speed vs. accuracy configuration