VoxaTrace/com.musicmuni.voxatrace.calibra/CalibraVAD

CalibraVAD

Voice Activity Detection (VAD) for identifying speech/singing in audio.

What is VAD?

Voice Activity Detection determines when someone is speaking or singing vs. when there's silence or background noise. Use it for:

Recording apps: Auto-start/stop recording when voice is detected
Transcription: Skip silent sections to save processing
Singing evaluation: Only score segments where the user is actually singing
Noise gate control: Mute audio when no voice is present

When to Use

Scenario	Backend	Why
Simple voice detection	`General`	Fast, no model required
Accurate speech detection	`Speech`	Silero neural network, best for speech
Low-latency singing	`SingingRealtime`	SwiftF0-based, minimal delay

Quick Start

Kotlin

// Simple energy-based detection (no model required)
val vad = CalibraVAD.create(VADModelProvider.General)
val ratio = vad.getVADRatio(samples, sampleRate = 48000)  // 0.0 to 1.0
if (ratio > 0.5f) {
    println("Voice detected!")
}
vad.release()

Swift

// Simple energy-based detection (no model required)
let vad = CalibraVAD.create(modelProvider: .general())
let ratio = vad.getVADRatio(samples: samples, sampleRate: 48000)
if ratio > 0.5 {
    print("Voice detected!")
}
vad.release()

Usage Tiers

Tier 1: Simple Creation (80% of users)

Kotlin

// GENERAL backend (no model required)
val vad = CalibraVAD.create(VADModelProvider.General)

// SPEECH backend (Silero model)
val vad = CalibraVAD.create(VADModelProvider.Speech { ModelLoader.loadSpeechVAD() })

Swift

// GENERAL backend (no model required)
let vad = CalibraVAD.create(modelProvider: .general())

// SPEECH backend (Silero model)
let vad = CalibraVAD.create(
    modelProvider: .speech { ModelLoader.shared.loadSpeechVAD() }
)

Tier 2: Custom Config (15% of users)

Kotlin

val config = VADConfig.Builder()
    .preset(VADConfig.SPEECH)
    .threshold(0.4f)
    .build()
val vad = CalibraVAD.create(config, VADModelProvider.Speech { ModelLoader.loadSpeechVAD() })

Swift

let config = VADConfig.Builder()
    .preset(.speech)
    .threshold(0.4)
    .build()
let vad = CalibraVAD.create(
    config: config,
    modelProvider: .speech { ModelLoader.shared.loadSpeechVAD() }
)

Streaming Mode

For real-time processing, use streaming mode:

Kotlin

recorder.audioBuffers.collect { buffer ->
    vad.acceptWaveform(buffer.toFloatArray(), sampleRate = 48000)
    if (vad.isVoiceDetected()) {
        showVoiceIndicator()
    }
}

Swift

for await buffer in recorder.audioBuffers {
    vad.acceptWaveform(samples: buffer.toFloatArray(), sampleRate: 48000)
    if vad.isVoiceDetected() {
        showVoiceIndicator()
    }
}

Platform Notes

iOS

Audio from microphone is typically 48kHz; resampled internally to 16kHz
Neural network backends (Speech, Singing) require ONNX Runtime
General backend works without any external dependencies

Android

Audio from microphone varies by device (44.1kHz, 48kHz, 16kHz)
Neural network backends use ONNX Runtime for Android
General backend is pure Kotlin, no native dependencies

Common Pitfalls

Forgetting to release: Call vad.release() to free native resources
Using Speech backend for singing: Speech VAD is trained on speech, not singing
Too sensitive threshold: Default thresholds work well; lower values = more false positives
Not resetting between streams: Call vad.reset() when starting a new audio source