LLM Inference
Streaming Token Generation
Stream LLM tokens in real-time on Android and iOS with ZETIC Melange.
Melange's LLM engine supports streaming token generation, allowing you to display tokens to the user as they are generated rather than waiting for the complete response.
How Streaming Works
- Call
run(prompt)to start the generation context. - Call
waitForNextToken()in a loop to receive tokens one at a time. - When
generatedTokensequals 0, generation is complete.
Full Implementation
val model = ZeticMLangeLLMModel(context, PERSONAL_KEY, MODEL_NAME)
// Initiate generation context
model.run("prompt")
val sb = StringBuilder()
// Asynchronous Token Consumption Loop
while (true) {
val waitResult = model.waitForNextToken()
val token = waitResult.token
val generatedTokens = waitResult.generatedTokens
if (generatedTokens == 0) break
// Append token to stream buffer
if (token.isNotEmpty()) sb.append(token)
}
val output = sb.toString()let model = try ZeticMLangeLLMModel(personalKey: PERSONAL_KEY, name: MODEL_NAME)
// Initiate generation context
try model.run("prompt")
var buffer = ""
// Asynchronous Token Consumption Loop
while true {
let waitResult = model.waitForNextToken()
let token = waitResult.token
let generatedTokens = waitResult.generatedTokens
if (generatedTokens == 0) {
break
}
buffer.append(token)
}
let output = bufferStreaming to the UI
For a responsive chat experience, update the UI with each token as it arrives:
lifecycleScope.launch(Dispatchers.IO) {
val model = ZeticMLangeLLMModel(context, PERSONAL_KEY, MODEL_NAME)
model.run(userPrompt)
while (true) {
val result = model.waitForNextToken()
if (result.generatedTokens == 0) break
withContext(Dispatchers.Main) {
// Update UI with new token
textView.append(result.token)
}
}
}DispatchQueue.global().async {
do {
let model = try ZeticMLangeLLMModel(personalKey: PERSONAL_KEY, name: MODEL_NAME)
try model.run(userPrompt)
while true {
let result = model.waitForNextToken()
if result.generatedTokens == 0 { break }
DispatchQueue.main.async {
// Update UI with new token
self.textView.text?.append(result.token)
}
}
} catch {
print("LLM error: \(error)")
}
}Context Management
KV Cache Cleanup Policy
The LLM engine manages a KV cache for conversation context. You can configure how it handles a full cache:
CLEAN_UP_ON_FULL(default): Clears the entire context when the KV cache is full.DO_NOT_CLEAN_UP: Keeps the context without cleanup when the KV cache is full.
When using DO_NOT_CLEAN_UP, calling run() again without calling cleanUp() may cause unexpected behavior.
Cleaning Up
Call cleanUp() to reset the conversation context:
model.cleanUp()
// Now you can start a new conversation
model.run("New prompt")Next Steps
- Supported LLM Models: Available models for on-device inference
- LLM Inference Modes: Speed vs. accuracy configuration
- LLM Inference Overview: API reference details