Audio Classification (YAMNet)

What is YAMNet?

YAMNet is a deep neural network that predicts audio events from the AudioSet-YouTube corpus.

Trained on the AudioSet dataset with 521 audio event classes
Model on Tensorflow Hub: Tensorflow Hub

Step-by-step Implementation

Prerequisites

Prepare the YAMNet model from TensorFlow Hub or Hugging Face and convert it to ONNX format.

Convert YAMNet to ONNX:

import tensorflow as tf
import tensorflow_hub as hub
import tf2onnx
import numpy as np

model = hub.load('https://tfhub.dev/google/yamnet/1')
concrete_func = model.signatures['serving_default']

input_shape = [1, 16000]
sample_input = np.random.randn(*input_shape).astype(np.float32)

input_tensor = tf.convert_to_tensor(waveform, dtype=tf.float32)

tf.saved_model.save(
    model,
    "yamnet_saved_model",
    signatures=concrete_func
)

# Now use tf2onnx command line
# python -m tf2onnx.convert --saved-model yamnet_saved_model --output yamnet.onnx --opset 13

Prepare sample input:

import numpy as np

sample_rate = 16000
duration = 1  # 1 second
waveform = np.sin(2 * np.pi * 440 * np.linspace(0, duration, sample_rate))
waveform = waveform.astype(np.float32)
waveform = np.expand_dims(waveform, axis=0)
np.save('waveform.npy', waveform)

Generate ZETIC.MLange Model

If you want to generate your own model, you can upload the model and input with MLange Dashboard,

or use CLI:

zetic gen -p $PROJECT_NAME -i waveform.npy yamnet.onnx

Implement ZeticMLangeModel

We provide a model key for the demo app: yamnet. You can use this model key to try the ZETIC.MLange Application.

For detailed application setup, please follow the Deploy to Android Studio guide.

    val yamnetModel = ZeticMLangeModel(this, "yamnet")

    yamnetModel.run(inputs)

    val outputs = yamnetModel.outputBuffers

For detailed application setup, please follow the Deploy to Xcode guide.

    let yamnetModel = ZeticMLangeModel("yamnet")

    yamnetModel.run(inputs)

    let outputs = yamnetModel.getOutputDataArray()

Prepare Audio feature extractor

We provide an Audio Feature Extractor as an Android and iOS module.

    // (1) Preprocess audio data and get processed float array
    val inputs = preprocess(audioData)

    // ... run model ...

    // (2) Postprocess model outputs
    val results = postprocess(outputs)

    import ZeticMLange

    // (1) Preprocess audio data and get processed float array
    let inputs = preprocess(audioData)

    // ... run model ...

    // (2) Postprocess model outputs
    let results = postprocess(&outputs)

Complete Audio Classification Implementation

    // (0) Initialize model
    val yamnetModel = ZeticMLangeModel(this, "yamnet")

    // (1) Preprocess audio
    val inputs = preprocess(audioData)

    // (2) Run model
    yamnetModel.run(inputs)
    val outputs = yamnetModel.outputBuffers

    // (3) Postprocess results
    val predictions = postprocess(outputs)

    // (0) Initialize model
    let yamnetModel = ZeticMLangeModel("yamnet")

    // (1) Preprocess audio
    let inputs = preprocess(audioData)

    // (2) Run model
    yamnetModel.run(inputs)
    let outputs = yamnetModel.getOutputDataArray()

    // (3) Postprocess results
    let predictions = postprocess(&outputs)

With ZETIC.MLange, implementing on-device audio classification with NPU acceleration is straightforward and efficient. YAMNet provides robust audio event detection capabilities across a wide range of categories. The simple pipeline of audio preprocessing and classification makes it easy to integrate into your applications.

We're continuously adding new models to our examples and HuggingFace page.

Stay tuned, and contact us for collaborations!