Speech Recognition with Whisper#
What is Whisper?#
Whisper is a state-of-the-art speech recognition model developed by OpenAI that offers:
Multilingual support: Recognizes speech in multiple languages
Multiple capabilities: Performs speech recognition, language detection, and translation
Open source: Available through Hugging Face
Step-by-Step Implementation#
Prerequisites#
To convert the model for deployment, we need to convert the PyTorch model to TorchScript format using torch.jit.trace
. This process requires input samples that match the model’s expected tensor shapes. For this example, we’ll prepare real audio data for the encoder input and dummy tensors for the decoder.
Prepare Sample Inputs for Tracing#
from datasets import load_dataset
import numpy as np
import torch
from transformers import WhisperFeatureExtractor, WhisperForConditionalGeneration
model_name = "openai/whisper-tiny"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(
model_name, sampling_rate=16_000
)
# Load sample dataset
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
# Prepare encoder inputs
inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features
# Prepare decoder inputs
MAX_TOKEN_LENGTH = model.config.max_target_positions
dummy_decoder_input_ids = torch.tensor([[0 for _ in range(MAX_TOKEN_LENGTH)]])
dummy_encoder_hidden_states = torch.randn(1, 1500, model.config.d_model).float()
dummy_decoder_attention_mask = torch.ones_like(dummy_decoder_input_ids)
Convert PyTorch Model to TorchScript Format#
Encoder Conversion:
from transformers import WhisperModel
import torch.nn as nn
class WhisperEncoderWrapper(nn.Module):
def __init__(self, whisper_model):
super().__init__()
self.enc = whisper_model.model.encoder
def forward(self, input_features):
return self.enc(input_features=input_features, return_dict=False)[0]
with torch.no_grad():
encoder = WhisperEncoderWrapper(model).eval()
traced_encoder = torch.jit.trace(encoder, input_features)
traced_encoder.save("whisper_encoder.pt")
Decoder Conversion:
class WhisperDecoderWrapper(nn.Module):
def __init__(self, whisper_model):
super().__init__()
self.decoder = whisper_model.model.decoder
self.proj_out = whisper_model.proj_out
def forward(self, input_ids, encoder_hidden_states, decoder_attention_mask):
hidden = self.decoder(
input_ids=input_ids,
encoder_hidden_states=encoder_hidden_states,
attention_mask=decoder_attention_mask,
use_cache=False,
return_dict=False,
)[0]
return self.proj_out(hidden)
with torch.no_grad():
decoder = WhisperDecoderWrapper(model).eval()
traced_decoder = torch.jit.trace(
decoder, (dummy_decoder_input_ids, dummy_encoder_hidden_states, dummy_decoder_attention_mask)
)
traced_decoder.save("whisper_decoder.pt")
Save Input Samples in .npy Format#
import numpy as np
# Save encoder inputs
np.save("whisper_input_features.npy", input_features.cpu().numpy())
# Save decoder inputs
np.save(
"whisper_decoder_input_ids.npy",
dummy_decoder_input_ids.cpu().numpy().astype(np.int64),
)
np.save(
"whisper_encoder_hidden_states.npy",
dummy_encoder_hidden_states.cpu().numpy().astype(np.float32),
)
np.save(
"whisper_decoder_attention_mask.npy",
dummy_decoder_attention_mask.cpu().numpy().astype(np.int64),
)
1. Generate ZETIC.MLange model#
For more details about the zetic CLI, please refer to: https://docs.zetic.ai/steps/generate_model_key/generate-to-CLI.html
Get your own MLange model key
# Upload your whisper encoder and decoder model. # encoder zetic gen -p USER_NAME/MODEL_NAME -i whisper_input_features.npy whisper_encoder.pt ... # decoder zetic gen -p USER_NAME/MODEL_NAME \ -i whisper_decoder_input_ids.npy \ -i whisper_encoder_hidden_states.npy \ -i whisper_decoder_attention_mask.npy \ whisper_decoder.pt
Expected output
Uploading model from {MODEL_PATH} to project {USER_NAME/MODEL_NAME} Starting upload process... Project: {USER_NAME/MODEL_NAME} Model Path: {MODEL_PATH} request: FILES Upload completed successfully! Model is converting... Your model key is {YOUR_MODEL_KEY} Visit your model dashboard => https://mlange.zetic.ai/p/{USER_NAME}/{MODEL_NAME}/models/{YOUR_MODEL_KEY}
2. Implement ZeticMLangeModel with your model key#
We provide a model key for the demo app:
OpenAI/whisper-tiny-encoder
andOpenAI/whisper-tiny-decoder
. You can use this model key to try the Zetic.MLange Application.Android app
For detailed application setup, please follow
deploy to Android Studio
ZETIC.MLange usage in
Kotlin
val encoderModel = ZeticMLangeModel(this, "OpenAI/whisper-tiny-encoder") val decoderModel = ZeticMLangeModel(this, "OpenAI/whisper-tiny-decoder") ... val outputs = encoderModel.run(inputs) ... decoderModel.run(..., ...)
iOS app
For detailed application setup, please follow
deploy to XCode
ZETIC.MLange usage in
Swift
let encoderModel = ZeticMLangeModel("OpenAI/whisper-tiny-encoder") let decoderModel = ZeticMLangeModel("OpenAI/whisper-tiny-decoder") ... let outputs = encoderModel.run(inputs) ... decoderModel.run(..., ...)
3. Whisper Model Implementation Structure#
The Whisper implementation consists of three main components:
Feature Extractor: Processes raw audio into Mel Spectrogram
Encoder: Processes Mel Spectrogram to generate audio embeddings
Decoder: Generates text tokens from the audio embeddings
you can find WhisperDecoder
and WhisperEncoder
in ZETIC MLange apps
Complete Speech Recognition Implementation#
For Android (Kotlin)
// Initialize components val whisper = WhisperFeatureWrapper() val encoder = ZeticMLangeModel(this, "OpenAI/whisper-tiny-encoder") val decoder = ZeticMLangeModel(this, "OpenAI/whisper-tiny-decoder") // Process audio val features = whisper.process(audioData) // Run encoder encoder.process(features) // Generate tokens using decoder val generatedIds = decoder.generateTokens(outputs) // Convert tokens to text val text = whisper.decodeToken(generatedIds.toIntArray(), true)
For iOS (Swift)
// Initialize components let wrapper = WhisperFeatureWrapper() let encoder = ZeticMLangeModel("OpenAI/whisper-tiny-encoder") let decoder = ZeticMLangeModel("OpenAI/whisper-tiny-decoder") // Process audio to features let features = wrapper.process(input.audio) // Run encoder let outputs = encoder.process(features) // Generate tokens using decoder let generatedIds = decoder.process(outputs) // Convert tokens to text let text = wrapper.decodeToken(generatedIds, true) return WhisperOutput(text: text)
Conclusion#
With ZETIC.MLange, implementing on-device speech recognition with NPU acceleration is straightforward and efficient. Whisper provides robust multilingual speech recognition and translation capabilities. The simple pipeline of audio preprocessing and recognition makes it easy to integrate into your applications. We’re continuously adding new models to our examples and HuggingFace page. Stay tuned, and contact us for collaborations!