Documentation Index
Fetch the complete documentation index at: https://daily-docs-pr-4386.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
SarvamSTTService provides real-time speech recognition using Sarvam AI’s WebSocket API, supporting Indian language transcription with Voice Activity Detection (VAD) and multiple audio formats for high-accuracy speech recognition.
Sarvam STT API Reference
Pipecat’s API methods for Sarvam STT integration
Example Implementation
Complete example with interruption handling
Sarvam Documentation
Official Sarvam AI STT documentation and features
Sarvam AI Platform
Access API keys and speech models
Installation
To use Sarvam services, install the required dependency:Prerequisites
Sarvam AI Account Setup
Before using Sarvam STT services, you need:- Sarvam AI Account: Sign up at Sarvam AI
- API Key: Generate an API key from your account dashboard
- Model Access: Access to Saarika (STT) or Saaras (STT-Translate) models, including the
saaras:v3model with support for multiple modes (transcribe, translate, verbatim, translit, codemix)
Required Environment Variables
SARVAM_API_KEY: Your Sarvam AI API key for authentication
Configuration
SarvamSTTService
Sarvam API key for authentication.
Sarvam model to use. Allowed values:
"saarika:v2.5" (standard STT),
"saaras:v2.5" (STT-Translate, auto-detects language), "saaras:v3"
(advanced, supports mode and fine-grained VAD). Deprecated in v0.0.105. Use
settings=SarvamSTTService.Settings(...) instead.Audio sample rate in Hz. Defaults to 16000 if not specified.
Mode of operation. Only applicable to models that support it (e.g.,
saaras:v3). Defaults to the model’s default mode.Audio codec/format of the input file.
Configuration parameters for Sarvam STT service. Deprecated in v0.0.105. Use
settings=SarvamSTTService.Settings(...) instead.Runtime-configurable settings for the STT service. See Settings
below.
Seconds of no audio before sending silence to keep the connection alive.
None disables keepalive.P99 latency from speech end to final transcript in seconds. Override for your
deployment. See stt-benchmark.
Seconds between idle checks when keepalive is enabled.
Settings
Runtime-configurable settings passed via thesettings constructor argument using SarvamSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | STT model identifier. (Inherited from base STT settings.) |
language | Language | str | None | Target language for transcription. (Inherited from base STT settings.) Behavior varies by model: saarika:v2.5 defaults to “unknown” (auto-detect), saaras:v2.5 ignores this (auto-detects), saaras:v3 defaults to “en-IN”. |
prompt | str | None | Optional prompt to guide transcription/translation style. Only applicable to saaras:v2.5. |
vad_signals | bool | None | Enable VAD signals in responses. When enabled, the service broadcasts UserStartedSpeakingFrame and UserStoppedSpeakingFrame from the server. |
high_vad_sensitivity | bool | None | Enable high VAD sensitivity for more responsive speech detection. |
positive_speech_threshold | float | None | VAD probability threshold (0.0-1.0) above which a frame is considered speech. Only for saaras:v3. |
negative_speech_threshold | float | None | VAD probability threshold (0.0-1.0) below which a frame is considered silence. Only for saaras:v3. |
min_speech_frames | int | None | Minimum consecutive speech frames to start a speech segment. Only for saaras:v3. |
first_turn_min_speech_frames | int | None | Minimum speech frames for the first user turn. Only for saaras:v3. |
negative_frames_count | int | None | Number of silence frames within the window to end a speech segment. Only for saaras:v3. |
negative_frames_window | int | None | Sliding window size (in frames) for counting negative frames. Only for saaras:v3. |
start_speech_volume_threshold | float | None | Volume level (dB) below which audio is too quiet to be speech. Only for saaras:v3. |
interrupt_min_speech_frames | int | None | Minimum speech frames to register a barge-in/interruption. Only for saaras:v3. |
pre_speech_pad_frames | int | None | Number of audio frames to prepend before detected speech onset. Only for saaras:v3. |
num_initial_ignored_frames | int | None | Number of leading audio frames to skip at connection start. Only for saaras:v3. |
Usage
Basic Setup
With Language and Model Configuration
With Server-Side VAD
Notes
- Default model changed: As of this update, the default model is
saaras:v3(previouslysaarika:v2.5). Applications that relied on the previous default should setsettings=SarvamSTTService.Settings(model="saarika:v2.5")explicitly. - Supported languages: Bengali (bn-IN), Gujarati (gu-IN), Hindi (hi-IN), Kannada (kn-IN), Malayalam (ml-IN), Marathi (mr-IN), Tamil (ta-IN), Telugu (te-IN), Punjabi (pa-IN), Odia (od-IN), English (en-IN), and Assamese (as-IN).
- Model-specific parameter validation: The service validates that parameters are compatible with the selected model. For example,
promptis only supported withsaaras:v2.5,languageis not supported withsaaras:v2.5(which auto-detects language), and the fine-grained VAD parameters are only supported withsaaras:v3. - Fine-grained VAD tuning (saaras:v3 only): The
saaras:v3model supports server-side VAD with 10 tuning parameters for speech detection thresholds, frame-count controls, pre-speech padding, interruption sensitivity, and initial-frame skipping. These parameters are only available with thesaaras:v3model. - VAD modes: When
vad_signals=False(default), the service relies on Pipecat’s local VAD and flushes the server buffer onVADUserStoppedSpeakingFrame. Whenvad_signals=True, the service uses Sarvam’s server-side VAD and broadcasts speaking frames from the server.
Event Handlers
In addition to the standard service connection events (on_connected, on_disconnected, on_connection_error), Sarvam STT provides:
| Event | Description |
|---|---|
on_speech_started | Speech detected in the audio stream |
on_speech_stopped | Speech stopped |
on_utterance_end | End of utterance detected |