Skip to main content

Voice Activity Detection

Voice Activity Detection (VAD) determines when someone is speaking or singing versus when there's silence or background noise.

What is VAD?

VAD answers a simple question: "Is there voice in this audio?"

Unlike pitch detection (which tells you what note is being sung), VAD tells you whether someone is singing at all.

Use Cases

ScenarioHow VAD Helps
Recording appAuto-start/stop recording when voice is detected
TranscriptionSkip silent sections to save processing
Singing evaluationOnly score segments where the user is singing
Noise gateMute output when no voice is present
Voice commandsDetect when user starts speaking

How It Works

VAD analyzes audio and returns:

  1. Voice Detected (boolean): Is voice present right now?
  2. VAD Ratio (0.0 to 1.0): What percentage of the audio contains voice?

Backends

VoxaTrace offers four VAD backends optimized for different scenarios:

General (Energy-based)

Simple and fast. Detects when audio exceeds an energy threshold.

AspectDetails
SpeedVery fast
AccuracyBasic
DependenciesNone
Best ForSimple voice detection, low-resource devices
val vad = CalibraVAD.create(VADModelProvider.General)

Speech (Silero VAD)

Neural network trained on speech. Very accurate for spoken content.

AspectDetails
SpeedFast
AccuracyExcellent for speech
DependenciesONNX Runtime
Best ForVoice assistants, transcription
val vad = CalibraVAD.create(
VADModelProvider.Speech { ModelLoader.loadSpeechVAD() }
)

Singing Realtime (SwiftF0-based)

Uses pitch detection confidence for low-latency singing detection.

AspectDetails
SpeedVery fast
AccuracyGood for singing
DependenciesONNX Runtime
Best ForReal-time singing apps, low latency
val vad = CalibraVAD.create(
VADModelProvider.SingingRealtime { ModelLoader.loadSingingRealtimeVAD() }
)

API Modes

Batch Mode

Analyze a complete audio segment:

val vad = CalibraVAD.create(VADModelProvider.General)

// Get ratio of voiced frames (0.0 = all silence, 1.0 = all voice)
val ratio = vad.getVADRatio(samples, sampleRate = 48000)

// Get rich result with level classification
val result = vad.analyze(samples, sampleRate = 48000)
println("Ratio: ${result?.ratio}, Level: ${result?.level}")

vad.release()

Streaming Mode

For real-time processing, feed audio continuously:

val vad = CalibraVAD.create(VADModelProvider.Speech { ... })

recorder.audioBuffers.collect { buffer ->
vad.acceptWaveform(buffer.toFloatArray(), buffer.sampleRate)

if (vad.isVoiceDetected()) {
showVoiceIndicator()
} else {
hideVoiceIndicator()
}
}

vad.release()

VAD Result

The analyze() method returns a VADResult:

data class VADResult(
val ratio: Float, // 0.0 to 1.0
val level: VoiceActivityLevel, // NONE, PARTIAL, FULL
val isVoiceDetected: Boolean, // ratio > 0.3
val isFullActivity: Boolean // ratio > 0.7
)

Use the level for UI feedback:

when (result.level) {
VoiceActivityLevel.NONE -> showSilenceState()
VoiceActivityLevel.PARTIAL -> showWeakVoice()
VoiceActivityLevel.FULL -> showStrongVoice()
}

Threshold Tuning

Adjust sensitivity with VADConfig:

val config = VADConfig.Builder()
.preset(VADConfig.SPEECH)
.threshold(0.3f) // Lower = more sensitive (more false positives)
.build()

val vad = CalibraVAD.create(config, VADModelProvider.Speech { ... })
ThresholdBehavior
0.2Very sensitive, catches faint voice
0.4Balanced (default for Speech)
0.6Strict, requires clear voice
0.8Very strict, only loud/clear voice

Common Patterns

Auto-Record When Voice Detected

var isRecording = false

recorder.audioBuffers.collect { buffer ->
vad.acceptWaveform(buffer.toFloatArray(), buffer.sampleRate)

if (vad.isVoiceDetected() && !isRecording) {
isRecording = true
startRecording()
} else if (!vad.isVoiceDetected() && isRecording) {
// Add debounce to avoid cutting off between words
isRecording = false
stopRecording()
}
}

Skip Silent Sections for Analysis

val contour = pitchExtractor.extract(audio, sampleRate)
val vadRatio = vad.getVADRatio(audio, sampleRate)

if (vadRatio < 0.1f) {
// Mostly silence, skip scoring
return SkipResult("No voice detected")
}

// Proceed with evaluation
val score = evaluator.evaluate(contour)

Combine with Pitch Detection

recorder.audioBuffers.collect { buffer ->
val samples = buffer.toFloatArray()

// Quick VAD check first (cheap)
vad.acceptWaveform(samples, buffer.sampleRate)
if (!vad.isVoiceDetected()) {
updateUI(pitch = null) // Show silence
return@collect
}

// Only run pitch detection if voice detected (expensive)
val point = detector.detect(samples, buffer.sampleRate)
updateUI(pitch = point)
}

Performance Considerations

BackendLatencyMemory
General< 1msMinimal
Speech~5ms~10 MB (model)
Singing~20ms~50 MB (model)
SingingRealtime~5ms~10 MB (model)

For battery-sensitive apps, use General backend and only load neural models when needed.

Next Steps