Audio Classification (YAMNet)#

What is YAMNet#

  • YAMNet is a deep neural network that predicts audio events from the AudioSet-YouTube corpus

  • YAMNet was trained on the AudioSet dataset, which includes 521 audio event classes

  • YAMNet hugging face: link

Step-by-step implementation#

0. Prerequisites#

Prepare the YAMNet model from TensorFlow Hub or Hugging Face.

  • YAMNet: Convert the TensorFlow model to ONNX format

    import tensorflow as tf
    import tensorflow_hub as hub
    import tf2onnx
    import numpy as np
    
    model = hub.load('https://tfhub.dev/google/yamnet/1')
    concrete_func = model.signatures['serving_default']
    
    input_shape = [1, 16000]
    sample_input = np.random.randn(*input_shape).astype(np.float32)
    
    input_tensor = tf.convert_to_tensor(waveform, dtype=tf.float32)
    
    tf.saved_model.save(
        model,
        "yamnet_saved_model",
        signatures=concrete_func
    )
    
    # Now use tf2onnx command line
    # python -m tf2onnx.convert --saved-model yamnet_saved_model --output yamnet.onnx --opset 13
    
  • Prepare sample input for converting:

    import numpy as np
    
    sample_rate = 16000
    duration = 1  # 1 second
    waveform = np.sin(2 * np.pi * 440 * np.linspace(0, duration, sample_rate))
    waveform = waveform.astype(np.float32)
    waveform = np.expand_dims(waveform, axis=0)
    np.save('waveform.npy', waveform)
    

1. Generate ZETIC.MLange model#

  • Get your own MLange model key

    # (1) Get mlange_gen
    $ wget https://github.com/zetic-ai/ZETIC_MLange_document/raw/main/bin/mlange_gen && chmod 755 mlange_gen
    
    # (2) Run mlange_gen for YAMNet model
    $ ./mlange_gen -m yamnet.onnx -i waveform.npy
    
    • Expected output

    ...
    MLange Model Key : {YOUR_YAMNET_MODEL_KEY}
    ...
    

2. Implement ZeticMLangeModel with your model key#

  • We provide a model key for the demo app: yamnet. You can use this model key to try the Zetic.MLange Application.

  • Android app

      val yamnetModel = ZeticMLangeModel(this, "yamnet")
    
      yamnetModel.run(inputs)
    
      val outputs = yamnetModel.outputBuffers
    
  • iOS app

    • For detailed application setup, please follow deploy to XCode

    • ZETIC.MLange usage in Swift

      let yamnetModel = ZeticMLangeModel("yamnet")
    
      yamnetModel.run(inputs)
    
      let outputs = yamnetModel.getOutputDataArray()
    

3. Prepare Audio Feature Extractor for Android and iOS#

  • We provide an Audio Feature Extractor as an Android and iOS module

    • You can use your own feature extractor if you have one for audio processing

  • For Android

    // (1) Preprocess audio data and get processed float array
    val inputs = preprocess(audioData)
    
    ...
    
    // (2) Postprocess model outputs
    val results = postprocess(outputs)
    
  • For iOS

    import ZeticMLange
    
    // (1) Preprocess audio data and get processed float array
    let inputs = preprocess(audioData)
    
    ...
    
    // (2) Postprocess model outputs
    let results = postprocess(&outputs)
    

Complete Audio Classification Implementation#

  • For Android

    • Kotlin

    // (0) Initialize Model
    val yamnetModel = ZeticMLangeModel(this, "yamnet")
    
    // (1) Preprocess Audio
    val inputs = preprocess(audioData)
    
    // (2) Run Model
    yamnetModel.run(inputs)
    val outputs = yamnetModel.outputBuffers
    
    // (3) Postprocess Results
    val predictions = postprocess(outputs)
    
  • For iOS

    • Swift

    // (0) Initialize Model
    let yamnetModel = ZeticMLangeModel("yamnet")
    
    // (1) Preprocess Audio
    let inputs = preprocess(audioData)
    
    // (2) Run Model
    yamnetModel.run(inputs)
    let outputs = yamnetModel.getOutputDataArray()
    
    // (3) Postprocess Results
    let predictions = postprocess(&outputs)
    

Conclusion#

With ZETIC.MLange, implementing on-device audio classification with NPU acceleration is straightforward and efficient. YAMNet provides robust audio event detection capabilities across a wide range of categories. The simple pipeline of audio preprocessing and classification makes it easy to integrate into your applications. We’re continuously adding new models to our examples and HuggingFace page. Stay tuned, and contact us for collaborations!