Skip to main content

CalibraVAD

Voice Activity Detection (VAD) for identifying speech and singing in audio. Determines when someone is speaking or singing vs. when there's silence or background noise.

Quick Start

Kotlin

val vad = CalibraVAD.create(VADModelProvider.General)
val ratio = vad.getVADRatio(samples, sampleRate = 48000)
if (ratio > 0.5f) {
println("Voice detected!")
}
vad.release()

Swift

let vad = CalibraVAD.create(.general)
let ratio = vad.getVADRatio(samples: samples, sampleRate: 48000)
if ratio > 0.5 {
print("Voice detected!")
}
vad.release()

Backends

BackendKotlin ProviderSwift ProviderBest ForDependencies
GeneralVADModelProvider.General.generalSimple detection, low powerNone
SpeechVADModelProvider.speech().speech()Speech detectionSilero ONNX
SingingRealtimeVADModelProvider.singingRealtime().singingRealtime()Low-latency singingSwiftF0 ONNX

Creating a VAD

Backend and defaults are inferred from the provider type:

// General (no model required)
val vad = CalibraVAD.create(VADModelProvider.General)

// Speech (Silero model)
val vad = CalibraVAD.create(VADModelProvider.speech())

// Singing realtime (SwiftF0 model)
val vad = CalibraVAD.create(VADModelProvider.singingRealtime())
let vad = CalibraVAD.create(.general)
let vad = CalibraVAD.create(.speech())
let vad = CalibraVAD.create(.singingRealtime())

With Custom Model Provider

// Custom model loader
val vad = CalibraVAD.create(VADModelProvider.Speech { ModelLoader.loadSpeechVAD() })
val vad = CalibraVAD.create(VADModelProvider.SingingRealtime { ModelLoader.loadSingingRealtimeVAD() })
let vad = CalibraVAD.create(.speech { ModelLoader.loadSpeechVAD() })
let vad = CalibraVAD.create(.singingRealtime { ModelLoader.loadSingingRealtimeVAD() })

With Config + Model Provider

val config = VADConfig.Builder()
.preset(VADConfig.SPEECH)
.threshold(0.4f)
.build()

val vad = CalibraVAD.create(config, VADModelProvider.speech())
let config = VADConfig.Builder()
.preset(.speech)
.threshold(0.4)
.build()

let vad = CalibraVAD.create(config: config, modelProvider: .speech())

Configuration

Presets

PresetKotlinSwiftDescription
SpeechVADConfig.SPEECH.speechSilero neural network
GeneralVADConfig.GENERAL.generalRMS-based heuristics
SingingRealtimeVADConfig.SINGING_REALTIME.singingRealtimeSwiftF0 pitch-based

Config Properties

PropertyTypeDefaultDescription
backendVADBackendSPEECHDetection engine
sampleRateInt16000Audio sample rate in Hz
thresholdFloat0.5Detection threshold (0.0–1.0)
minSpeechDurationFloat0.25Minimum speech duration in seconds
minSilenceDurationFloat0.25Minimum silence duration in seconds
windowSizeInt512Processing window size in samples
numThreadsInt1Number of inference threads
modelPathString?nullCustom model path (null = bundled model)
rmsThresholdFloat0.05RMS threshold for GENERAL backend
pitchProbThresholdFloat0.5Pitch probability threshold for GENERAL backend
minPitchFloat50Minimum pitch in Hz for GENERAL backend

Builder Methods

MethodDescription
preset(config)Start from a preset configuration
backend(backend)Set backend with its default config
sampleRate(rate)Set audio sample rate in Hz
threshold(value)Set detection threshold (0.0–1.0)
minSpeechDuration(seconds)Minimum speech duration
minSilenceDuration(seconds)Minimum silence duration
windowSize(samples)Processing window size
numThreads(threads)Number of inference threads
modelPath(path)Custom model path (null = bundled)
rmsThreshold(value)RMS threshold for GENERAL backend
pitchProbThreshold(value)Pitch probability threshold for GENERAL backend
minPitch(hz)Minimum pitch in Hz for GENERAL backend

One-Shot Analysis

getVADRatio

Returns the ratio of voiced frames (0.0 to 1.0). Accepts any sample rate and resamples internally to 16kHz.

val ratio = vad.getVADRatio(samples, sampleRate = 48000)
// 0.0 = silence, 1.0 = continuous voice
let ratio = vad.getVADRatio(samples: samples, sampleRate: 48000)

analyze

Returns a rich VADResult with ratio, level, and convenience properties:

val result = vad.analyze(samples, sampleRate = 48000)
if (result != null) {
println("Ratio: ${result.ratio}")
println("Level: ${result.level}") // NONE, PARTIAL, or FULL
println("Voice detected: ${result.isVoiceDetected}")
}

VADResult

PropertyTypeDescription
ratioFloatVoice activity ratio (0.0–1.0)
levelVoiceActivityLevelNONE, PARTIAL, or FULL
isVoiceDetectedBooleanTrue if ratio > 0.5
isFullActivityBooleanTrue if level is FULL

Streaming Mode

For real-time processing, use acceptWaveform + isVoiceDetected:

Kotlin

recorder.audioBuffers.collect { buffer ->
vad.acceptWaveform(buffer.toFloatArray(), sampleRate = 48000)
if (vad.isVoiceDetected()) {
showVoiceIndicator()
}
}

Swift

for await buffer in recorder.audioBuffers {
vad.acceptWaveform(samples: buffer.toFloatArray(), sampleRate: 48000)
if vad.isVoiceDetected() {
showVoiceIndicator()
}
}

Swift Convenience Methods

// Quick check with custom threshold
let hasVoice = vad.hasVoiceActivity(samples: samples, sampleRate: 48000, threshold: 0.3)

// Classify activity level
let level = vad.classifyVoiceActivity(samples: samples, sampleRate: 48000)
// .none, .partial, or .full

Methods

MethodDescription
getVADRatio(samples, sampleRate)Get voiced frame ratio (0.0–1.0)
analyze(samples, sampleRate)Get rich VADResult with level classification
acceptWaveform(samples, sampleRate)Feed samples for streaming detection
isVoiceDetected()Check if voice is currently detected (streaming mode)
reset()Reset state for new audio stream
release()Release all resources

Common Patterns

VAD with Recording

class VoiceDetectionViewModel : ViewModel() {
private var vad: CalibraVAD? = null

val isVoiceActive = MutableStateFlow(false)
val voiceLevel = MutableStateFlow(0f)

fun startListening() {
vad = CalibraVAD.create(VADModelProvider.speech())

viewModelScope.launch {
recorder.audioBuffers.collect { buffer ->
val ratio = vad!!.getVADRatio(buffer.samples, buffer.sampleRate)
voiceLevel.value = ratio
isVoiceActive.value = ratio > 0.5f
}
}
}

override fun onCleared() {
vad?.release()
}
}

Next Steps