Speech Recognition with Whisper#

What is Whisper?#

Whisper is a state-of-the-art speech recognition model developed by OpenAI that offers:

  • Multilingual support: Recognizes speech in multiple languages

  • Multiple capabilities: Performs speech recognition, language detection, and translation

  • Open source: Available through Hugging Face

Step-by-Step Implementation#

Prerequisites#

To convert the model for deployment, we need to convert the PyTorch model to TorchScript format using torch.jit.trace. This process requires input samples that match the model’s expected tensor shapes. For this example, we’ll prepare real audio data for the encoder input and dummy tensors for the decoder.

Prepare Sample Inputs for Tracing#

from datasets import load_dataset
import numpy as np
import torch
from transformers import WhisperFeatureExtractor, WhisperForConditionalGeneration

model_name = "openai/whisper-tiny"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(
    model_name, sampling_rate=16_000
)

# Load sample dataset
ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)

# Prepare encoder inputs
inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features

# Prepare decoder inputs
MAX_TOKEN_LENGTH = model.config.max_target_positions
dummy_decoder_input_ids = torch.tensor([[0 for _ in range(MAX_TOKEN_LENGTH)]])
dummy_encoder_hidden_states = torch.randn(1, 1500, model.config.d_model).float()
dummy_decoder_attention_mask = torch.ones_like(dummy_decoder_input_ids)

Convert PyTorch Model to TorchScript Format#

Encoder Conversion:

from transformers import WhisperModel
import torch.nn as nn

class WhisperEncoderWrapper(nn.Module): 
    def __init__(self, whisper_model):
        super().__init__()
        self.enc = whisper_model.model.encoder

    def forward(self, input_features):
        return self.enc(input_features=input_features, return_dict=False)[0]

with torch.no_grad():
    encoder = WhisperEncoderWrapper(model).eval()
    traced_encoder = torch.jit.trace(encoder, input_features)
    traced_encoder.save("whisper_encoder.pt")

Decoder Conversion:

class WhisperDecoderWrapper(nn.Module):
    def __init__(self, whisper_model):
        super().__init__()
        self.decoder = whisper_model.model.decoder
        self.proj_out = whisper_model.proj_out

    def forward(self, input_ids, encoder_hidden_states, decoder_attention_mask):
        hidden = self.decoder(
            input_ids=input_ids,
            encoder_hidden_states=encoder_hidden_states,
            attention_mask=decoder_attention_mask,
            use_cache=False,
            return_dict=False,
        )[0]
        return self.proj_out(hidden)

with torch.no_grad():
    decoder = WhisperDecoderWrapper(model).eval()
    traced_decoder = torch.jit.trace(
        decoder, (dummy_decoder_input_ids, dummy_encoder_hidden_states, dummy_decoder_attention_mask)
    )
    traced_decoder.save("whisper_decoder.pt")

Save Input Samples in .npy Format#

import numpy as np
# Save encoder inputs
np.save("whisper_input_features.npy", input_features.cpu().numpy()) 

# Save decoder inputs
np.save(
    "whisper_decoder_input_ids.npy",
    dummy_decoder_input_ids.cpu().numpy().astype(np.int64),
)
np.save(
    "whisper_encoder_hidden_states.npy",
    dummy_encoder_hidden_states.cpu().numpy().astype(np.float32),
)
np.save(
    "whisper_decoder_attention_mask.npy",
    dummy_decoder_attention_mask.cpu().numpy().astype(np.int64),
)

1. Generate ZETIC.MLange model#

For more details about the zetic CLI, please refer to: https://docs.zetic.ai/steps/generate_model_key/generate-to-CLI.html

  • Get your own MLange model key

    # Upload your whisper encoder and decoder model.
    # encoder
    zetic gen -p USER_NAME/MODEL_NAME -i whisper_input_features.npy whisper_encoder.pt
    ...
    # decoder 
    zetic gen -p USER_NAME/MODEL_NAME \
    -i whisper_decoder_input_ids.npy \
    -i whisper_encoder_hidden_states.npy \
    -i whisper_decoder_attention_mask.npy \
    whisper_decoder.pt
    
    • Expected output

    Uploading model from {MODEL_PATH} to project {USER_NAME/MODEL_NAME}
    Starting upload process...
    Project: {USER_NAME/MODEL_NAME}
    Model Path: {MODEL_PATH}
    request: FILES
    Upload completed successfully!
    Model is converting...
    Your model key is {YOUR_MODEL_KEY}
    Visit your model dashboard => https://mlange.zetic.ai/p/{USER_NAME}/{MODEL_NAME}/models/{YOUR_MODEL_KEY}
    

2. Implement ZeticMLangeModel with your model key#

  • We provide a model key for the demo app: OpenAI/whisper-tiny-encoder and OpenAI/whisper-tiny-decoder. You can use this model key to try the Zetic.MLange Application.

  • Android app

      val encoderModel = ZeticMLangeModel(this, "OpenAI/whisper-tiny-encoder")
      val decoderModel = ZeticMLangeModel(this, "OpenAI/whisper-tiny-decoder")
    
      ...
    
      val outputs = encoderModel.run(inputs)
    
      ...
    
      decoderModel.run(..., ...)
    
  • iOS app

    • For detailed application setup, please follow deploy to XCode

    • ZETIC.MLange usage in Swift

      let encoderModel = ZeticMLangeModel("OpenAI/whisper-tiny-encoder")
      let decoderModel = ZeticMLangeModel("OpenAI/whisper-tiny-decoder")
    
      ...
      
      let outputs = encoderModel.run(inputs)
    
      ...
    
      decoderModel.run(..., ...)
    

3. Whisper Model Implementation Structure#

The Whisper implementation consists of three main components:

  1. Feature Extractor: Processes raw audio into Mel Spectrogram

  2. Encoder: Processes Mel Spectrogram to generate audio embeddings

  3. Decoder: Generates text tokens from the audio embeddings

you can find WhisperDecoder and WhisperEncoder in ZETIC MLange apps

Complete Speech Recognition Implementation#

  • For Android (Kotlin)

    // Initialize components
    val whisper = WhisperFeatureWrapper()
    val encoder = ZeticMLangeModel(this, "OpenAI/whisper-tiny-encoder")
    val decoder = ZeticMLangeModel(this, "OpenAI/whisper-tiny-decoder")
    
    // Process audio
    val features = whisper.process(audioData)
    
    // Run encoder
    encoder.process(features)
    
    // Generate tokens using decoder
    val generatedIds = decoder.generateTokens(outputs)
    
    // Convert tokens to text
    val text = whisper.decodeToken(generatedIds.toIntArray(), true)
    
  • For iOS (Swift)

    // Initialize components
    let wrapper = WhisperFeatureWrapper()
    let encoder = ZeticMLangeModel("OpenAI/whisper-tiny-encoder")
    let decoder = ZeticMLangeModel("OpenAI/whisper-tiny-decoder")
    
    // Process audio to features
    let features = wrapper.process(input.audio)
    
    // Run encoder
    let outputs = encoder.process(features)
    
    // Generate tokens using decoder
    let generatedIds = decoder.process(outputs)
    
    // Convert tokens to text
    let text = wrapper.decodeToken(generatedIds, true)
    return WhisperOutput(text: text)
    

Conclusion#

With ZETIC.MLange, implementing on-device speech recognition with NPU acceleration is straightforward and efficient. Whisper provides robust multilingual speech recognition and translation capabilities. The simple pipeline of audio preprocessing and recognition makes it easy to integrate into your applications. We’re continuously adding new models to our examples and HuggingFace page. Stay tuned, and contact us for collaborations!