Documentation Index
Fetch the complete documentation index at: https://daily-docs-pr-4386.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
NVIDIA Nemotron Speech provides two STT service implementations:NvidiaSTTService— Real-time streaming transcription using Nemotron ASR Streaming models with interim results and continuous audio processing.NvidiaSegmentedSTTService— Segmented transcription using Canary models with advanced language support, word boosting, and enterprise-grade accuracy.
NVIDIA Nemotron Speech STT API Reference
Pipecat’s API methods for NVIDIA Nemotron Speech STT integration
Example Implementation
Complete example with NVIDIA services integration
NVIDIA ASR NIM Documentation
Official NVIDIA ASR NIM documentation
NVIDIA Developer Portal
Access API keys and Nemotron Speech services
Installation
To use NVIDIA Nemotron Speech services, install the required dependency:Prerequisites
NVIDIA Nemotron Speech Setup
Before using NVIDIA Nemotron Speech STT services, you need:- NVIDIA Developer Account (for cloud deployments): Sign up at NVIDIA Developer Portal
- API Key (for cloud deployments): Generate an NVIDIA API key for Nemotron Speech services
- Model Selection: Choose between Nemotron ASR Streaming (streaming) and Canary (segmented) models
Environment Variables
NVIDIA_API_KEY: Your NVIDIA API key for authentication (required for cloud endpoint, not needed for local deployments)
NvidiaSTTService
Real-time streaming transcription using NVIDIA Nemotron Speech’s streaming ASR models.NVIDIA API key for authentication. Required when using the cloud endpoint. Not
needed for local deployments.
NVIDIA Nemotron Speech server address. For local deployments, pass the local
address (e.g.
localhost:50051).Mapping containing
function_id and model_name for the ASR model.Audio sample rate in Hz. When
None, uses the pipeline’s configured sample
rate.Additional configuration parameters. Deprecated in v0.0.105. Use
settings=NvidiaSTTService.Settings(...) instead.Whether to use SSL for the gRPC connection. Defaults to
True for the NVIDIA
cloud endpoint. Set to False for local deployments.Number of audio channels.
VAD start history in frames. Use
-1 for Nemotron Speech default.VAD start threshold. Use
-1.0 for Nemotron Speech default.VAD stop history in frames. Use
-1 for Nemotron Speech default.VAD stop threshold. Use
-1.0 for Nemotron Speech default.End-of-utterance stop history in frames. Use
-1 for Nemotron Speech default.End-of-utterance stop threshold. Use
-1.0 for Nemotron Speech default.Custom Nemotron Speech configuration string (e.g.
"enable_vad_endpointing:true,neural_vad.onset:0.65").P99 latency from speech end to final transcript in seconds. Override for your
deployment. See stt-benchmark.
Settings
Runtime-configurable settings passed via thesettings constructor argument using NvidiaSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | STT model identifier. (Inherited from base STT settings.) |
language | Language | str | Language.EN_US | Target language for transcription. (Inherited from base STT settings.) |
profanity_filter | bool | False | Whether to filter profanity from results. |
automatic_punctuation | bool | True | Whether to add automatic punctuation. |
verbatim_transcripts | bool | True | Whether to return verbatim transcripts. |
boosted_lm_words | list[str] | None | List of words to boost in the language model. |
boosted_lm_score | float | 4.0 | Score boost for specified words. |
max_alternatives | int | 1 | Maximum number of recognition alternatives. |
interim_results | bool | True | Whether to return interim (partial) results. |
word_time_offsets | bool | False | Whether to include word-level time offsets. |
speaker_diarization | bool | False | Whether to enable speaker diarization. |
diarization_max_speakers | int | 0 | Maximum number of speakers for diarization. |
Usage
Notes
- Model cannot be changed after initialization: Use the
model_function_mapparameter in the constructor to specify the model and function ID. - Streaming: Provides real-time interim and final results through continuous audio streaming.
- Metrics support: This service supports metrics generation (
can_generate_metrics()returnsTrue).
NvidiaSegmentedSTTService
Batch/segmented transcription using NVIDIA Nemotron Speech’s Canary models. Processes complete audio segments after VAD detects speech boundaries.NVIDIA API key for authentication. Required when using the cloud endpoint. Not
needed for local deployments.
NVIDIA Nemotron Speech server address. For local deployments, pass the local
address (e.g.
localhost:50051).Mapping containing
function_id and model_name for the ASR model.Audio sample rate in Hz. When
None, uses the pipeline’s configured sample
rate.Additional configuration parameters. Deprecated in v0.0.105. Use
settings=NvidiaSegmentedSTTService.Settings(...) instead.Runtime-configurable settings. See Settings below.
Whether to use SSL for the gRPC connection. Defaults to
True for the NVIDIA
cloud endpoint. Set to False for local deployments.Custom Nemotron Speech configuration string (e.g.
"enable_vad_endpointing:true,neural_vad.onset:0.65").P99 latency from speech end to final transcript in seconds. Override for your
deployment. See stt-benchmark.
Settings
Runtime-configurable settings passed via thesettings constructor argument using NvidiaSegmentedSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | STT model identifier. (Inherited from base STT settings.) |
language | Language | str | Language.EN_US | Target language for transcription. (Inherited from base STT settings.) |
profanity_filter | bool | False | Whether to filter profanity from results. |
automatic_punctuation | bool | True | Whether to add automatic punctuation. |
verbatim_transcripts | bool | False | Whether to return verbatim transcripts. |
boosted_lm_words | list[str] | None | List of words to boost in the language model. |
boosted_lm_score | float | 4.0 | Score boost for specified words. |
max_alternatives | int | 1 | Maximum number of recognition alternatives. |
word_time_offsets | bool | False | Whether to include word-level time offsets. |
Usage
Notes
- Model cannot be changed after initialization: Use the
model_function_mapparameter in the constructor to specify the model and function ID. - Segmented processing: Processes complete audio segments for higher accuracy compared to streaming.
- Language support: Supports Arabic, English (US/GB), French, German, Hindi, Italian, Japanese, Korean, Portuguese (BR), Russian, and Spanish (ES/US). See the NVIDIA ASR NIM documentation for the complete list.
- Word boosting: Use
boosted_lm_wordsandboosted_lm_scoreto improve recognition of domain-specific terms.