Speech Recognition with Whisper
Build on-device AI speech recognition applications with ZETIC.MLange and OpenAI's Whisper
What is Whisper?
Whisper is a state-of-the-art speech recognition model developed by OpenAI that offers:
- Multilingual support: Recognizes speech in multiple languages
- Multiple capabilities: Performs speech recognition, language detection, and translation
- Open source: Available through Hugging Face
Step-by-step Implementation
Prerequisites
To convert the model for deployment, we need to convert the PyTorch model to TorchScript format using torch.jit.trace. This process requires input samples that match the model's expected tensor shapes.
Prepare sample inputs for tracing
from datasets import load_dataset
import numpy as np
import torch
from transformers import WhisperFeatureExtractor, WhisperForConditionalGeneration
model_name = "openai/whisper-tiny"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(
model_name, sampling_rate=16_000
)
# Load sample dataset
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
# Prepare encoder inputs
inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features
# Prepare decoder inputs
MAX_TOKEN_LENGTH = model.config.max_target_positions
dummy_decoder_input_ids = torch.tensor([[0 for _ in range(MAX_TOKEN_LENGTH)]])
dummy_encoder_hidden_states = torch.randn(1, 1500, model.config.d_model).float()
dummy_decoder_attention_mask = torch.ones_like(dummy_decoder_input_ids)Convert PyTorch model to TorchScript format
Encoder conversion:
from transformers import WhisperModel
import torch.nn as nn
class WhisperEncoderWrapper(nn.Module):
def __init__(self, whisper_model):
super().__init__()
self.enc = whisper_model.model.encoder
def forward(self, input_features):
return self.enc(input_features=input_features, return_dict=False)[0]
with torch.no_grad():
encoder = WhisperEncoderWrapper(model).eval()
traced_encoder = torch.jit.trace(encoder, input_features)
traced_encoder.save("whisper_encoder.pt")Decoder conversion:
class WhisperDecoderWrapper(nn.Module):
def __init__(self, whisper_model):
super().__init__()
self.decoder = whisper_model.model.decoder
self.proj_out = whisper_model.proj_out
def forward(self, input_ids, encoder_hidden_states, decoder_attention_mask):
hidden = self.decoder(
input_ids=input_ids,
encoder_hidden_states=encoder_hidden_states,
attention_mask=decoder_attention_mask,
use_cache=False,
return_dict=False,
)[0]
return self.proj_out(hidden)
with torch.no_grad():
decoder = WhisperDecoderWrapper(model).eval()
traced_decoder = torch.jit.trace(
decoder,
(dummy_decoder_input_ids, dummy_encoder_hidden_states, dummy_decoder_attention_mask)
)
traced_decoder.save("whisper_decoder.pt")Save input samples in .npy format
import numpy as np
# Save encoder inputs
np.save("whisper_input_features.npy", input_features.cpu().numpy())
# Save decoder inputs
np.save(
"whisper_decoder_input_ids.npy",
dummy_decoder_input_ids.cpu().numpy().astype(np.int64),
)
np.save(
"whisper_encoder_hidden_states.npy",
dummy_encoder_hidden_states.cpu().numpy().astype(np.float32),
)
np.save(
"whisper_decoder_attention_mask.npy",
dummy_decoder_attention_mask.cpu().numpy().astype(np.int64),
)Generate ZETIC.MLange model
If you want to generate your own models, you can upload the models and inputs with MLange Dashboard,
or use CLI:
# Upload encoder model
zetic gen -p $PROJECT_NAME -i whisper_input_features.npy whisper_encoder.pt
# Upload decoder model
zetic gen -p $PROJECT_NAME \
-i whisper_decoder_input_ids.npy \
-i whisper_encoder_hidden_states.npy \
-i whisper_decoder_attention_mask.npy \
whisper_decoder.ptImplement ZeticMLangeModel
We provide model keys for the demo app: OpenAI/whisper-tiny-encoder and OpenAI/whisper-tiny-decoder. You can use these model keys to try the ZETIC.MLange Application.
For detailed application setup, please follow the Deploy to Android Studio guide.
val encoderModel = ZeticMLangeModel(this, "OpenAI/whisper-tiny-encoder")
val decoderModel = ZeticMLangeModel(this, "OpenAI/whisper-tiny-decoder")
// Run encoder
val outputs = encoderModel.run(inputs)
// Run decoder
decoderModel.run(..., ...)For detailed application setup, please follow the Deploy to Xcode guide.
let encoderModel = ZeticMLangeModel("OpenAI/whisper-tiny-encoder")
let decoderModel = ZeticMLangeModel("OpenAI/whisper-tiny-decoder")
// Run encoder
let outputs = encoderModel.run(inputs)
// Run decoder
decoderModel.run(..., ...)Whisper model implementation structure
The Whisper implementation consists of three main components:
- Feature Extractor: Processes raw audio into Mel Spectrogram
- Encoder: Processes Mel Spectrogram to generate audio embeddings
- Decoder: Generates text tokens from the audio embeddings
You can find WhisperDecoder and WhisperEncoder implementations in ZETIC MLange apps.
Complete Speech Recognition Implementation
// Initialize components
val whisper = WhisperFeatureWrapper()
val encoder = ZeticMLangeModel(this, "OpenAI/whisper-tiny-encoder")
val decoder = ZeticMLangeModel(this, "OpenAI/whisper-tiny-decoder")
// Process audio
val features = whisper.process(audioData)
// Run encoder
encoder.process(features)
// Generate tokens using decoder
val generatedIds = decoder.generateTokens(outputs)
// Convert tokens to text
val text = whisper.decodeToken(generatedIds.toIntArray(), true) // Initialize components
let wrapper = WhisperFeatureWrapper()
let encoder = ZeticMLangeModel("OpenAI/whisper-tiny-encoder")
let decoder = ZeticMLangeModel("OpenAI/whisper-tiny-decoder")
// Process audio to features
let features = wrapper.process(input.audio)
// Run encoder
let outputs = encoder.process(features)
// Generate tokens using decoder
let generatedIds = decoder.process(outputs)
// Convert tokens to text
let text = wrapper.decodeToken(generatedIds, true)
return WhisperOutput(text: text)Conclusion
With ZETIC.MLange, implementing on-device speech recognition with NPU acceleration is straightforward and efficient. Whisper provides robust multilingual speech recognition and translation capabilities. The simple pipeline of audio preprocessing and recognition makes it easy to integrate into your applications.
We're continuously adding new models to our examples and HuggingFace page.
Stay tuned, and contact us for collaborations!