Audio Understanding (Qwen2.5-Omni)
Run on-device audio understanding with Qwen2.5-Omni — feed audio embeddings directly into the LLM decoder via runWithEmbeddings.
This tutorial uses the Multimodal (Beta) APIs introduced in ZeticMLange 1.7.0-beta.1 (Android + iOS). The APIs are stable for the audio path covered here but may evolve before general availability.
Build an on-device audio understanding application using Alibaba's Qwen2.5-Omni 3B model. Unlike pure speech-to-text (transcription), Qwen2.5-Omni is a unified multimodal LLM that can describe, reason about, and answer questions about audio input — speech, music, ambient sounds, and more.
What You Will Build
A two-stage on-device pipeline:
mic input → mel spectrogram → audio encoder
→ flat embedding sequence
→ LLM decoder (runWithEmbeddings)
→ streaming text outputThe audio encoder and LLM decoder are two separate models. On RAM-constrained devices (≤ 6 GB) they cannot coexist, so the application swaps them: load encoder → run encoder → unload encoder → load decoder → runWithEmbeddings() → stream tokens.
Prerequisites
- A ZETIC Melange account with a Personal Key (sign up at melange.zetic.ai)
- Android Studio or Xcode for mobile deployment
- A physical device with at least 6 GB RAM (iPhone 15 Pro or later for iOS; comparable Android device)
- Wi-Fi for first-run model downloads (encoder ≈ 2.4 GB + decoder ≈ 2 GB)
What is Qwen2.5-Omni?
Qwen2.5-Omni is Alibaba's unified multimodal LLM. The key distinction:
- Inputs: text, audio, image, video
- Outputs: text, speech
- Audio understanding, not just transcription: it answers questions like "what kind of music is this?", "how does the speaker feel?", or "list the sounds you hear" — capabilities a speech-to-text model cannot provide.
This tutorial uses the audio → text subset.
Architecture Overview
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Microphone │ ─▶ │ Mel │ ─▶ │ Audio │ ─▶ │ LLM Decoder │ ─▶ Text
│ 16 kHz mono │ │ Spectrogram │ │ Encoder │ │ (Qwen-Omni) │
│ float32 PCM │ │ 128 × N │ │ → embeddings │ │ │
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘Step-by-step:
- Audio capture — 16 kHz mono float32 PCM via the platform audio API.
- Mel spectrogram — 400-point DFT, 160-sample hop, 128 mel bands (Slaney scale, Whisper-compatible).
- Audio encoder —
ZeticMLangeModelrunningzetic/qwen2.5_omni_audio_encoder_chunk_f16. 200 mel frames per chunk →[50, 2048]embeddings per chunk. - Embedding injection — A ChatML chat template wraps the encoder output with
<|audio_bos|>/<|audio_eos|>markers; the helper produces a flat embedding buffer suitable forrunWithEmbeddings. - LLM decoder —
ZeticMLangeLLMModelrunningzetic/QWEN_2.5_omni_3b_decoder. Token stream is consumed viawaitForNextToken.
Step 1: Platform Setup
For project scaffolding, follow the platform-specific setup guide first:
For the audio understanding pipeline, your dependency must be at least 1.7.0-beta.1:
// build.gradle.kts
dependencies {
implementation("com.zeticai.mlange:mlange:1.7.0-beta.1")
}Use the 1.7.0-beta.1 ZeticMLange xcframework distribution. SPM and CocoaPods coordinates follow the platform integration guide.
Step 2: Mel Spectrogram Preprocessing
Qwen2.5-Omni's audio encoder expects log-mel features matching librosa's Slaney-scale filter bank — the same convention as Whisper. The repository ships a pre-computed filter bank as a binary asset (mel_filterbank.bin, shape [128, 201] row-major float32) generated with:
librosa.filters.mel(sr=16000, n_fft=400, n_mels=128, fmax=8000, norm='slaney', htk=False)Bundle mel_filterbank.bin alongside your app:
Place the file under app/src/main/assets/mel_filterbank.bin and load it with context.assets.open(...).
Add the file to the app bundle and resolve via Bundle.main.url(forResource: "mel_filterbank", withExtension: "bin").
The mel extractor runs a 400-point naive DFT with a periodic Hann window, applies the bundled filter bank, and finishes with Whisper's log-mel normalization:
mel[i] = max(log10(power_mel), max_log_mel - 8) / 4 + 1A complete reference implementation in Kotlin and Swift is provided in the sample project linked at the end of this tutorial.
Step 3: Audio Encoder
The audio encoder is a regular ZeticMLangeModel that consumes 200-frame mel chunks and emits 50 embedding tokens × 2048 dims per chunk. Concatenate per-chunk outputs and trim to the actual number of audio tokens.
import com.zeticai.mlange.core.model.ModelMode
import com.zeticai.mlange.core.model.ZeticMLangeModel
val encoder = ZeticMLangeModel(
context = context,
personalKey = PERSONAL_KEY,
name = "zetic/qwen2.5_omni_audio_encoder_chunk_f16",
modelMode = ModelMode.RUN_AUTO,
)
// For each 200-frame mel chunk:
// 1. Fill encoder.getInputBuffers()[0] with mel data ([1, 128, 200] float32)
// 2. Call encoder.run() and read encoder.outputs[0] as [1, 50, 2048] float32
// 3. Concatenate into a flat FloatArray sized [numTokens * 2048]import ZeticMLange
let encoder = try ZeticMLangeModel(
personalKey: PERSONAL_KEY,
name: "zetic/qwen2.5_omni_audio_encoder_chunk_f16",
target: .ZETIC_MLANGE_TARGET_COREML
)
// For each 200-frame mel chunk:
// 1. Fill the encoder input tensor with mel data ([1, 128, 200] float32)
// 2. Run inference
// 3. Concatenate the [1, 50, 2048] output into a flat [Float]The encoder is large. After running it, unload it before loading the decoder if your device has ≤ 6 GB total RAM — otherwise the decoder load will OOM.
Step 4: Chat Template + Embedding Injection
Qwen2.5-Omni expects audio embeddings to be wrapped in a Qwen ChatML prompt with <|audio_bos|> / <|audio_eos|> markers. The SDK ships a ready-to-use helper, QwenOmniAudioChatTemplate, that:
- Tokenizes the chat prefix and suffix (with
parseSpecial = trueso the markers become single tokens). - Looks up per-token embeddings from the model's
tok_embdtensor. - Concatenates
[prefix_embeds, audio_embeds, suffix_embeds]into one flat float buffer suitable forrunWithEmbeddings.
Before passing the loaded LLM model to the helper, validate it against the audio profile so an incompatible checkpoint fails fast:
import com.zeticai.mlange.core.model.llm.LLMModelMode
import com.zeticai.mlange.core.model.llm.ZeticMLangeLLMModel
import com.zeticai.mlange.core.model.multimodal.MultimodalProfile
import com.zeticai.mlange.core.model.multimodal.QwenOmniAudioChatTemplate
import com.zeticai.mlange.core.model.multimodal.validate
val decoder = ZeticMLangeLLMModel(
context = context,
personalKey = PERSONAL_KEY,
name = "zetic/QWEN_2.5_omni_3b_decoder",
modelMode = LLMModelMode.RUN_AUTO,
)
decoder.validate(MultimodalProfile.QWEN_OMNI_AUDIO)
val merged = QwenOmniAudioChatTemplate().build(
llm = decoder,
audioEmbeddings = audioEmbeddings, // FloatArray from Step 3
userText = "What do you hear in this audio?",
)
decoder.runWithEmbeddings(merged)import ZeticMLange
let decoder = try ZeticMLangeLLMModel(
personalKey: PERSONAL_KEY,
name: "zetic/QWEN_2.5_omni_3b_decoder"
)
try decoder.validate(profile: .qwenOmniAudio)
let merged = try QwenOmniAudioChatTemplate().build(
llm: decoder,
audioEmbeddings: audioEmbeddings, // [Float] from Step 3
userText: "What do you hear in this audio?"
)
_ = try decoder.runWithEmbeddings(merged)runWithEmbeddings queues the merged embedding sequence as a single decode batch but does not sample any tokens itself — the decode + sampling happen in the streaming loop next.
Step 5: Streaming the Response
Token streaming uses the same waitForNextToken() you already use with run(text:). The first call decodes the embedding batch and samples the first response token; subsequent calls decode and sample one token at a time.
val sb = StringBuilder()
while (true) {
val result = decoder.waitForNextToken()
if (result.generatedTokens == 0) break
if (result.token.isNotEmpty()) sb.append(result.token)
}
val response = sb.toString()
decoder.cleanUp()var response = ""
while true {
let result = decoder.waitForNextToken()
if result.generatedTokens == 0 { break }
if !result.token.isEmpty { response += result.token }
}
try? decoder.cleanUp()Step 6: Encoder ↔ Decoder Swap
The encoder + decoder pair does not fit in memory together on most current phones, so this tutorial recommends a swap pattern:
[app start] ← decoder cached on disk, encoder resident
[user starts recording] ← prewarm encoder (already resident, no-op)
[user stops recording] ← run encoder → unload encoder → load decoder
[stream tokens]
[response complete] ← unload decoder → reload encoder for next turnA typical swap (encoder unload + decoder load from disk cache) takes 5–10 seconds on iPhone 15 Pro. Subsequent loads from cache are noticeably faster than the very first download/compile cycle.
Devices with 8 GB+ RAM (e.g. iPhone 16 Pro and recent flagship Android) can sometimes hold both models simultaneously and skip swap entirely — but this is not guaranteed across vendors and OS versions. Validate on your target device.
Beta Notes
- Backend: Multimodal embedding injection is currently supported only on the llama.cpp backend. Calling
runWithEmbeddings/tokenize/tokenEmbeddings/specialTokenIdon a non-llama.cpp target throws. - Models: Tested with
zetic/qwen2.5_omni_audio_encoder_chunk_f16andzetic/QWEN_2.5_omni_3b_decoder. Other Qwen2.5-Omni-compatible checkpoints should work as long as their vocabulary contains the special tokens declared inMultimodalProfile.QWEN_OMNI_AUDIO. - Output language: The 3B decoder may occasionally produce Chinese for English audio. This is a base-model trait; sampling and prompt tuning can help.
- Audio quality: Validated primarily with clear English speech and short music clips. Heavy background noise, very short clips (< 0.5 s), or non-Whisper sample rates may degrade output quality.
See Also
- LLM Inference: Multimodal (Beta): API concepts and design background.
ZeticMLangeLLMModel— Android: Multimodal section and method signatures.ZeticMLangeLLMModel— iOS: Multimodal section and method signatures.