CalibraVAD

Voice Activity Detection (VAD) for identifying speech/singing in audio.

What is VAD?

Voice Activity Detection determines when someone is speaking or singing vs. when there's silence or background noise. Use it for:

  • Recording apps: Auto-start/stop recording when voice is detected

  • Transcription: Skip silent sections to save processing

  • Singing evaluation: Only score segments where the user is actually singing

  • Noise gate control: Mute audio when no voice is present

When to Use

ScenarioBackendWhy
Simple voice detectionGeneralFast, no model required
Accurate speech detectionSpeechSilero neural network, best for speech
Singing detectionSingingYAMNet-based, distinguishes singing from speech
Low-latency singingSingingRealtimeSwiftF0-based, minimal delay

Quick Start

Kotlin

// Simple energy-based detection (no model required)
val vad = CalibraVAD.create(VADModelProvider.General)
val ratio = vad.getVADRatio(samples, sampleRate = 48000) // 0.0 to 1.0
if (ratio > 0.5f) {
println("Voice detected!")
}
vad.release()

Swift

// Simple energy-based detection (no model required)
let vad = CalibraVAD.create(modelProvider: .general())
let ratio = vad.getVADRatio(samples: samples, sampleRate: 48000)
if ratio > 0.5 {
print("Voice detected!")
}
vad.release()

Usage Tiers

Tier 1: Simple Creation (80% of users)

Kotlin

// GENERAL backend (no model required)
val vad = CalibraVAD.create(VADModelProvider.General)

// SPEECH backend (Silero model)
val vad = CalibraVAD.create(VADModelProvider.Speech { ModelLoader.loadSpeechVAD() })

Swift

// GENERAL backend (no model required)
let vad = CalibraVAD.create(modelProvider: .general())

// SPEECH backend (Silero model)
let vad = CalibraVAD.create(
modelProvider: .speech { ModelLoader.shared.loadSpeechVAD() }
)

Tier 2: Custom Config (15% of users)

Kotlin

val config = VADConfig.Builder()
.preset(VADConfig.SPEECH)
.threshold(0.4f)
.build()
val vad = CalibraVAD.create(config, VADModelProvider.Speech { ModelLoader.loadSpeechVAD() })

Swift

let config = VADConfig.Builder()
.preset(.speech)
.threshold(0.4)
.build()
let vad = CalibraVAD.create(
config: config,
modelProvider: .speech { ModelLoader.shared.loadSpeechVAD() }
)

Streaming Mode

For real-time processing, use streaming mode:

Kotlin

recorder.audioBuffers.collect { buffer ->
vad.acceptWaveform(buffer.toFloatArray(), sampleRate = 48000)
if (vad.isVoiceDetected()) {
showVoiceIndicator()
}
}

Swift

for await buffer in recorder.audioBuffers {
vad.acceptWaveform(samples: buffer.toFloatArray(), sampleRate: 48000)
if vad.isVoiceDetected() {
showVoiceIndicator()
}
}

Platform Notes

iOS

  • Audio from microphone is typically 48kHz; resampled internally to 16kHz

  • Neural network backends (Speech, Singing) require ONNX Runtime

  • General backend works without any external dependencies

Android

  • Audio from microphone varies by device (44.1kHz, 48kHz, 16kHz)

  • Neural network backends use ONNX Runtime for Android

  • General backend is pure Kotlin, no native dependencies

Common Pitfalls

  1. Forgetting to release: Call vad.release() to free native resources

  2. Using Speech backend for singing: Speech VAD is trained on speech, not singing

  3. Too sensitive threshold: Default thresholds work well; lower values = more false positives

  4. Not resetting between streams: Call vad.reset() when starting a new audio source

See also

For pitch detection (what note, not just voice presence)

For live singing evaluation (uses VAD internally)

Configuration options for sensitivity tuning

Type-safe model provider selection

Types

Link copied to clipboard
object Companion

Functions

Link copied to clipboard
fun acceptWaveform(samples: FloatArray, sampleRate: Int = 16000)

Feed audio samples for streaming detection. Use with isSpeechDetected for real-time detection.

Link copied to clipboard
fun analyze(samples: FloatArray, sampleRate: Int = 16000): VADResult?

Analyze audio and return rich VAD result.

Link copied to clipboard
fun getVADRatio(samples: FloatArray, sampleRate: Int = 16000): Float

Get ratio of voiced frames in audio (0.0 to 1.0). Higher values indicate more speech/singing content.

Link copied to clipboard

Check if voice is currently detected. Call after acceptWaveform for streaming mode.

Link copied to clipboard
fun release()

Release all resources. Must be called when done to free native resources.

Link copied to clipboard
fun reset()

Reset VAD state for new audio stream. Call when starting to process a new audio source.