Speech

The Audio API provides a speech endpoint for converting text into lifelike spoken audio. It comes with 6 built-in voices and supports multiple audio output formats.

Quick start

The speech endpoint takes three key inputs: the model, the text to convert to audio, and the voice to use. A simple request looks like this:

1from openai import OpenAI
2from pathlib import Path
3
4client = OpenAI(
5    base_url="https://api.scx.ai/v1",
6    api_key="your-scx-api-key",
7)
8
9response = client.audio.speech.create(
10    model="tts-1",
11    voice="ito",
12    input="Hello! Welcome to scx text-to-speech.",
13)
14
15Path("output.mp3").write_bytes(response.content)
16

By default, the endpoint outputs an MP3 file of the spoken audio, but it can be configured to output other supported formats.

Audio quality

For real-time applications, the standard tts-1 model provides the lowest latency but at lower quality than the tts-1-hd model. tts-1 is optimized for speed, while tts-1-hd is optimized for quality. Depending on your listening device and the individual, the differences may not be noticeable in some cases.

Voice options

Experiment with different voices to find one that matches your desired tone and audience. The available voices are:

VoiceDescription
australian-samMale, Australian accent - natural and friendly
friendly-kiwiMale, New Zealand accent - casual and approachable
likeable-aussieFemale, Australian/NZ accent - likeable and pleasing
itoMale, American accent - warm and conversational
serene-assistantFemale, American accent - calm and professional
alice-bennettFemale, British accent - professional and articulate

Supported output formats

The default response format is MP3, but other formats are available:

FormatDescription
mp3Default format. Widely supported, good compression.
wavUncompressed audio. Higher quality, larger file size.
pcmRaw audio samples. Useful for further processing.

Request parameters

ParameterTypeDescriptionDefault
modelStringThe model to use: tts-1 or tts-1-hd.Required
inputStringThe text to generate audio for. Maximum length is 5000 characters.Required
voiceStringThe voice to use for synthesis. See voice options above.Required
response_formatStringThe audio format: mp3, wav, or pcm.mp3
speedNumberThe speed of the generated audio. Range: 0.25 to 4.0.1.0

Streaming real-time audio

The Speech API supports real-time audio streaming using chunk transfer encoding. This means audio can begin playing before the full file is generated.

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="https://api.scx.ai/v1",
5    api_key="your-scx-api-key",
6)
7
8response = client.audio.speech.create(
9    model="tts-1",
10    voice="serene-assistant",
11    input="This is a streaming test. The audio will start playing before generation completes.",
12)
13
14# Stream to a file
15with open("stream_output.mp3", "wb") as f:
16    for chunk in response.iter_bytes():
17        f.write(chunk)
18

Adjusting speed

You can adjust the speed of the generated audio by setting the speed parameter. Values range from 0.25 (slowest) to 4.0 (fastest), with 1.0 being the default normal speed.

1# Slower speech (0.75x speed)
2response = client.audio.speech.create(
3    model="tts-1",
4    voice="alice-bennett",
5    input="This will be spoken more slowly.",
6    speed=0.75,
7)
8
9# Faster speech (1.5x speed)
10response = client.audio.speech.create(
11    model="tts-1",
12    voice="alice-bennett",
13    input="This will be spoken more quickly.",
14    speed=1.5,
15)
16

Limitations

  • Maximum input text length is 5000 characters per request.
  • For longer content, split your text into chunks and generate audio for each segment.