NVIDIA Nemotron Speech

Overview

NVIDIA Nemotron Speech provides two STT service implementations:

NvidiaSTTService — Real-time streaming transcription using Nemotron ASR Streaming models with interim results and continuous audio processing.
NvidiaSegmentedSTTService — Segmented transcription using Canary models with advanced language support, word boosting, and enterprise-grade accuracy.

NVIDIA Nemotron Speech STT API Reference

Pipecat’s API methods for NVIDIA Nemotron Speech STT integration

Example Implementation

Complete example with NVIDIA services integration

NVIDIA ASR NIM Documentation

Official NVIDIA ASR NIM documentation

NVIDIA Developer Portal

Access API keys and Nemotron Speech services

Installation

To use NVIDIA Nemotron Speech services, install the required dependency:

uv add "pipecat-ai[nvidia]"

Prerequisites

NVIDIA Nemotron Speech Setup

Before using NVIDIA Nemotron Speech STT services, you need:

NVIDIA Developer Account (for cloud deployments): Sign up at NVIDIA Developer Portal
API Key (for cloud deployments): Generate an NVIDIA API key for Nemotron Speech services
Model Selection: Choose between Nemotron ASR Streaming (streaming) and Canary (segmented) models

For local deployments, you can run NVIDIA ASR NIM locally without an API key. See the NVIDIA ASR NIM documentation for deployment instructions.

Environment Variables

NVIDIA_API_KEY: Your NVIDIA API key for authentication (required for cloud endpoint, not needed for local deployments)

NvidiaSTTService

Real-time streaming transcription using NVIDIA Nemotron Speech’s streaming ASR models.

api_key

str | None

default:"None"

NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.

server

str

default:"grpc.nvcf.nvidia.com:443"

NVIDIA Nemotron Speech server address. For local deployments, pass the local address (e.g. localhost:50051).

model_function_map

Mapping[str, str]

Mapping containing function_id and model_name for the ASR model.

sample_rate

int

default:"None"

Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.

params

NvidiaSTTService.InputParams

default:"None"

deprecated

Additional configuration parameters. Deprecated in v0.0.105. Use settings=NvidiaSTTService.Settings(...) instead.

settings

NvidiaSTTService.Settings

default:"None"

Runtime-configurable settings. See Settings below.

use_ssl

bool

default:"True"

Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.

audio_channel_count

int

default:"1"

Number of audio channels.

start_history

int

default:"-1"

VAD start history in frames. Use -1 for Nemotron Speech default.

start_threshold

float

default:"-1.0"

VAD start threshold. Use -1.0 for Nemotron Speech default.

stop_history

int

default:"320"

VAD stop history in frames. Use -1 for Nemotron Speech default.

stop_threshold

float

default:"-1.0"

VAD stop threshold. Use -1.0 for Nemotron Speech default.

stop_history_eou

int

default:"-1"

End-of-utterance stop history in frames. Use -1 for Nemotron Speech default.

stop_threshold_eou

float

default:"-1.0"

End-of-utterance stop threshold. Use -1.0 for Nemotron Speech default.

custom_configuration

str

default:"\"\""

Custom Nemotron Speech configuration string (e.g. "enable_vad_endpointing:true,neural_vad.onset:0.65").

ttfs_p99_latency

float

default:"1.0"

P99 latency from speech end to final transcript in seconds. Override for your deployment. See stt-benchmark.

Settings

Runtime-configurable settings passed via the settings constructor argument using NvidiaSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`None`	STT model identifier. (Inherited from base STT settings.)
`language`	`Language \| str`	`Language.EN_US`	Target language for transcription. (Inherited from base STT settings.)
`profanity_filter`	`bool`	`False`	Whether to filter profanity from results.
`automatic_punctuation`	`bool`	`True`	Whether to add automatic punctuation.
`verbatim_transcripts`	`bool`	`True`	Whether to return verbatim transcripts.
`boosted_lm_words`	`list[str]`	`None`	List of words to boost in the language model.
`boosted_lm_score`	`float`	`4.0`	Score boost for specified words.
`max_alternatives`	`int`	`1`	Maximum number of recognition alternatives.
`interim_results`	`bool`	`True`	Whether to return interim (partial) results.
`word_time_offsets`	`bool`	`False`	Whether to include word-level time offsets.
`speaker_diarization`	`bool`	`False`	Whether to enable speaker diarization.
`diarization_max_speakers`	`int`	`0`	Maximum number of speakers for diarization.

Usage

from pipecat.services.nvidia.stt import NvidiaSTTService

stt = NvidiaSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
)

Notes

Model cannot be changed after initialization: Use the model_function_map parameter in the constructor to specify the model and function ID.
Streaming: Provides real-time interim and final results through continuous audio streaming.
Metrics support: This service supports metrics generation (can_generate_metrics() returns True).

NvidiaSegmentedSTTService

Batch/segmented transcription using NVIDIA Nemotron Speech’s Canary models. Processes complete audio segments after VAD detects speech boundaries.

api_key

str | None

default:"None"

NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.

server

str

default:"grpc.nvcf.nvidia.com:443"

NVIDIA Nemotron Speech server address. For local deployments, pass the local address (e.g. localhost:50051).

model_function_map

Mapping[str, str]

Mapping containing function_id and model_name for the ASR model.

sample_rate

int

default:"None"

Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.

params

NvidiaSegmentedSTTService.InputParams

default:"None"

deprecated

Additional configuration parameters. Deprecated in v0.0.105. Use settings=NvidiaSegmentedSTTService.Settings(...) instead.

settings

NvidiaSegmentedSTTService.Settings

default:"None"

Runtime-configurable settings. See Settings below.

use_ssl

bool

default:"True"

Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.

custom_configuration

str

default:"\"\""

Custom Nemotron Speech configuration string (e.g. "enable_vad_endpointing:true,neural_vad.onset:0.65").

ttfs_p99_latency

float

default:"1.0"

P99 latency from speech end to final transcript in seconds. Override for your deployment. See stt-benchmark.

Settings

Runtime-configurable settings passed via the settings constructor argument using NvidiaSegmentedSTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`None`	STT model identifier. (Inherited from base STT settings.)
`language`	`Language \| str`	`Language.EN_US`	Target language for transcription. (Inherited from base STT settings.)
`profanity_filter`	`bool`	`False`	Whether to filter profanity from results.
`automatic_punctuation`	`bool`	`True`	Whether to add automatic punctuation.
`verbatim_transcripts`	`bool`	`False`	Whether to return verbatim transcripts.
`boosted_lm_words`	`list[str]`	`None`	List of words to boost in the language model.
`boosted_lm_score`	`float`	`4.0`	Score boost for specified words.
`max_alternatives`	`int`	`1`	Maximum number of recognition alternatives.
`word_time_offsets`	`bool`	`False`	Whether to include word-level time offsets.

Usage

from pipecat.services.nvidia.stt import NvidiaSegmentedSTTService
from pipecat.transcriptions.language import Language

stt = NvidiaSegmentedSTTService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    settings=NvidiaSegmentedSTTService.Settings(
        language=Language.ES,
        automatic_punctuation=True,
        boosted_lm_words=["Pipecat", "NVIDIA"],
        boosted_lm_score=6.0,
    ),
)

Notes

Model cannot be changed after initialization: Use the model_function_map parameter in the constructor to specify the model and function ID.
Segmented processing: Processes complete audio segments for higher accuracy compared to streaming.
Language support: Supports Arabic, English (US/GB), French, German, Hindi, Italian, Japanese, Korean, Portuguese (BR), Russian, and Spanish (ES/US). See the NVIDIA ASR NIM documentation for the complete list.
Word boosting: Use boosted_lm_words and boosted_lm_score to improve recognition of domain-specific terms.

The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Pipecat Server

Pipecat Subagents

Client SDKs

Pipecat Flows

Pipecat Cloud

CLI

NVIDIA Nemotron Speech

Overview

NVIDIA Nemotron Speech STT API Reference

Example Implementation

NVIDIA ASR NIM Documentation

NVIDIA Developer Portal

Installation

Prerequisites

NVIDIA Nemotron Speech Setup

Environment Variables

NvidiaSTTService

Settings

Usage

Notes

NvidiaSegmentedSTTService

Settings

Usage

Notes

Pipecat Server

Pipecat Subagents

Client SDKs

Pipecat Flows

Pipecat Cloud

CLI

Documentation Index

​Overview

NVIDIA Nemotron Speech STT API Reference

Example Implementation

NVIDIA ASR NIM Documentation

NVIDIA Developer Portal

​Installation

​Prerequisites

​NVIDIA Nemotron Speech Setup

​Environment Variables

​NvidiaSTTService

​Settings

​Usage

​Notes

​NvidiaSegmentedSTTService

​Settings

​Usage

​Notes

Overview

Installation

Prerequisites

NVIDIA Nemotron Speech Setup

Environment Variables

NvidiaSTTService

Settings

Usage

Notes

NvidiaSegmentedSTTService

Settings

Usage

Notes