Calibra VAD
Voice Activity Detection (VAD) for identifying speech/singing in audio.
What is VAD?
Voice Activity Detection determines when someone is speaking or singing vs. when there's silence or background noise. Use it for:
Recording apps: Auto-start/stop recording when voice is detected
Transcription: Skip silent sections to save processing
Singing evaluation: Only score segments where the user is actually singing
Noise gate control: Mute audio when no voice is present
When to Use
| Scenario | Backend | Why |
|---|---|---|
| Simple voice detection | General | Fast, no model required |
| Accurate speech detection | Speech | Silero neural network, best for speech |
| Singing detection | Singing | YAMNet-based, distinguishes singing from speech |
| Low-latency singing | SingingRealtime | SwiftF0-based, minimal delay |
Quick Start
Kotlin
// Simple energy-based detection (no model required)
val vad = CalibraVAD.create(VADModelProvider.General)
val ratio = vad.getVADRatio(samples, sampleRate = 48000) // 0.0 to 1.0
if (ratio > 0.5f) {
println("Voice detected!")
}
vad.release()Swift
// Simple energy-based detection (no model required)
let vad = CalibraVAD.create(modelProvider: .general())
let ratio = vad.getVADRatio(samples: samples, sampleRate: 48000)
if ratio > 0.5 {
print("Voice detected!")
}
vad.release()Usage Tiers
Tier 1: Simple Creation (80% of users)
Kotlin
// GENERAL backend (no model required)
val vad = CalibraVAD.create(VADModelProvider.General)
// SPEECH backend (Silero model)
val vad = CalibraVAD.create(VADModelProvider.Speech { ModelLoader.loadSpeechVAD() })Swift
// GENERAL backend (no model required)
let vad = CalibraVAD.create(modelProvider: .general())
// SPEECH backend (Silero model)
let vad = CalibraVAD.create(
modelProvider: .speech { ModelLoader.shared.loadSpeechVAD() }
)Tier 2: Custom Config (15% of users)
Kotlin
val config = VADConfig.Builder()
.preset(VADConfig.SPEECH)
.threshold(0.4f)
.build()
val vad = CalibraVAD.create(config, VADModelProvider.Speech { ModelLoader.loadSpeechVAD() })Swift
let config = VADConfig.Builder()
.preset(.speech)
.threshold(0.4)
.build()
let vad = CalibraVAD.create(
config: config,
modelProvider: .speech { ModelLoader.shared.loadSpeechVAD() }
)Streaming Mode
For real-time processing, use streaming mode:
Kotlin
recorder.audioBuffers.collect { buffer ->
vad.acceptWaveform(buffer.toFloatArray(), sampleRate = 48000)
if (vad.isVoiceDetected()) {
showVoiceIndicator()
}
}Swift
for await buffer in recorder.audioBuffers {
vad.acceptWaveform(samples: buffer.toFloatArray(), sampleRate: 48000)
if vad.isVoiceDetected() {
showVoiceIndicator()
}
}Platform Notes
iOS
Audio from microphone is typically 48kHz; resampled internally to 16kHz
Neural network backends (Speech, Singing) require ONNX Runtime
General backend works without any external dependencies
Android
Audio from microphone varies by device (44.1kHz, 48kHz, 16kHz)
Neural network backends use ONNX Runtime for Android
General backend is pure Kotlin, no native dependencies
Common Pitfalls
Forgetting to release: Call
vad.release()to free native resourcesUsing Speech backend for singing: Speech VAD is trained on speech, not singing
Too sensitive threshold: Default thresholds work well; lower values = more false positives
Not resetting between streams: Call
vad.reset()when starting a new audio source
See also
For pitch detection (what note, not just voice presence)
For live singing evaluation (uses VAD internally)
Configuration options for sensitivity tuning
Type-safe model provider selection
Functions
Feed audio samples for streaming detection. Use with isSpeechDetected for real-time detection.
Analyze audio and return rich VAD result.
Get ratio of voiced frames in audio (0.0 to 1.0). Higher values indicate more speech/singing content.
Check if voice is currently detected. Call after acceptWaveform for streaming mode.