Melange
LLM Inference

Streaming Token Generation

Stream LLM tokens in real-time on Android and iOS with ZETIC Melange.

Melange's LLM engine supports streaming token generation, allowing you to display tokens to the user as they are generated rather than waiting for the complete response.

How Streaming Works

  1. Call run(prompt) to start the generation context.
  2. Call waitForNextToken() in a loop to receive tokens one at a time.
  3. When generatedTokens equals 0, generation is complete.

Full Implementation

val model = ZeticMLangeLLMModel(context, PERSONAL_KEY, MODEL_NAME)

// Initiate generation context
model.run("prompt")

val sb = StringBuilder()

// Asynchronous Token Consumption Loop
while (true) {
    val waitResult = model.waitForNextToken()
    val token = waitResult.token
    val generatedTokens = waitResult.generatedTokens

    if (generatedTokens == 0) break

    // Append token to stream buffer
    if (token.isNotEmpty()) sb.append(token)
}

val output = sb.toString()
let model = try ZeticMLangeLLMModel(personalKey: PERSONAL_KEY, name: MODEL_NAME)

// Initiate generation context
try model.run("prompt")

var buffer = ""

// Asynchronous Token Consumption Loop
while true {
    let waitResult = model.waitForNextToken()
    let token = waitResult.token
    let generatedTokens = waitResult.generatedTokens

    if (generatedTokens == 0) {
        break
    }

    buffer.append(token)
}

let output = buffer

Streaming to the UI

For a responsive chat experience, update the UI with each token as it arrives:

lifecycleScope.launch(Dispatchers.IO) {
    val model = ZeticMLangeLLMModel(context, PERSONAL_KEY, MODEL_NAME)
    model.run(userPrompt)

    while (true) {
        val result = model.waitForNextToken()
        if (result.generatedTokens == 0) break

        withContext(Dispatchers.Main) {
            // Update UI with new token
            textView.append(result.token)
        }
    }
}
DispatchQueue.global().async {
    do {
        let model = try ZeticMLangeLLMModel(personalKey: PERSONAL_KEY, name: MODEL_NAME)
        try model.run(userPrompt)

        while true {
            let result = model.waitForNextToken()
            if result.generatedTokens == 0 { break }

            DispatchQueue.main.async {
                // Update UI with new token
                self.textView.text?.append(result.token)
            }
        }
    } catch {
        print("LLM error: \(error)")
    }
}

Context Management

KV Cache Cleanup Policy

The LLM engine manages a KV cache for conversation context. You can configure how it handles a full cache:

  • CLEAN_UP_ON_FULL (default): Clears the entire context when the KV cache is full.
  • DO_NOT_CLEAN_UP: Keeps the context without cleanup when the KV cache is full.

When using DO_NOT_CLEAN_UP, calling run() again without calling cleanUp() may cause unexpected behavior.

Cleaning Up

Call cleanUp() to reset the conversation context:

model.cleanUp()
// Now you can start a new conversation
model.run("New prompt")

Next Steps