Speech
The Audio API provides a speech endpoint for converting text into lifelike spoken audio. It comes with 6 built-in voices and supports multiple audio output formats.
Quick start
The speech endpoint takes three key inputs: the model, the text to convert to audio, and the voice to use. A simple request looks like this:
By default, the endpoint outputs an MP3 file of the spoken audio, but it can be configured to output other supported formats.
Audio quality
For real-time applications, the standard tts-1 model provides the lowest latency but at lower quality than the tts-1-hd model. tts-1 is optimized for speed, while tts-1-hd is optimized for quality. Depending on your listening device and the individual, the differences may not be noticeable in some cases.
Voice options
Experiment with different voices to find one that matches your desired tone and audience. The available voices are:
| Voice | Description |
|---|---|
australian-sam | Male, Australian accent - natural and friendly |
friendly-kiwi | Male, New Zealand accent - casual and approachable |
likeable-aussie | Female, Australian/NZ accent - likeable and pleasing |
ito | Male, American accent - warm and conversational |
serene-assistant | Female, American accent - calm and professional |
alice-bennett | Female, British accent - professional and articulate |
Supported output formats
The default response format is MP3, but other formats are available:
| Format | Description |
|---|---|
mp3 | Default format. Widely supported, good compression. |
wav | Uncompressed audio. Higher quality, larger file size. |
pcm | Raw audio samples. Useful for further processing. |
Request parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
model | String | The model to use: tts-1 or tts-1-hd. | Required |
input | String | The text to generate audio for. Maximum length is 5000 characters. | Required |
voice | String | The voice to use for synthesis. See voice options above. | Required |
response_format | String | The audio format: mp3, wav, or pcm. | mp3 |
speed | Number | The speed of the generated audio. Range: 0.25 to 4.0. | 1.0 |
Streaming real-time audio
The Speech API supports real-time audio streaming using chunk transfer encoding. This means audio can begin playing before the full file is generated.
Adjusting speed
You can adjust the speed of the generated audio by setting the speed parameter. Values range from 0.25 (slowest) to 4.0 (fastest), with 1.0 being the default normal speed.
Limitations
- Maximum input text length is 5000 characters per request.
- For longer content, split your text into chunks and generate audio for each segment.