Speech

The Audio API provides a speech endpoint for converting text into lifelike spoken audio. It comes with 6 built-in voices and supports multiple audio output formats.

Quick start

The speech endpoint takes three key inputs: the model, the text to convert to audio, and the voice to use. A simple request looks like this:

1from openai import OpenAI
2from pathlib import Path
3
4client = OpenAI(
5    base_url="https://api.scx.ai/v1",
6    api_key="your-scx-api-key",
7)
8
9response = client.audio.speech.create(
10    model="tts-1",
11    voice="ito",
12    input="Hello! Welcome to scx text-to-speech.",
13)
14
15Path("output.mp3").write_bytes(response.content)
16

1from openai import OpenAI
2from pathlib import Path
3
4client = OpenAI(
5    base_url="https://api.scx.ai/v1",
6    api_key="your-scx-api-key",
7)
8
9response = client.audio.speech.create(
10    model="tts-1",
11    voice="ito",
12    input="Hello! Welcome to scx text-to-speech.",
13)
14
15Path("output.mp3").write_bytes(response.content)
16

By default, the endpoint outputs an MP3 file of the spoken audio, but it can be configured to output other supported formats.

Audio quality

For real-time applications, the standard tts-1 model provides the lowest latency but at lower quality than the tts-1-hd model. tts-1 is optimized for speed, while tts-1-hd is optimized for quality. Depending on your listening device and the individual, the differences may not be noticeable in some cases.

Voice options

Experiment with different voices to find one that matches your desired tone and audience. The available voices are:

Voice	Description
`australian-sam`	Male, Australian accent - natural and friendly
`friendly-kiwi`	Male, New Zealand accent - casual and approachable
`likeable-aussie`	Female, Australian/NZ accent - likeable and pleasing
`ito`	Male, American accent - warm and conversational
`serene-assistant`	Female, American accent - calm and professional
`alice-bennett`	Female, British accent - professional and articulate

Supported output formats

The default response format is MP3, but other formats are available:

Format	Description
`mp3`	Default format. Widely supported, good compression.
`wav`	Uncompressed audio. Higher quality, larger file size.
`pcm`	Raw audio samples. Useful for further processing.

Request parameters

Parameter	Type	Description	Default
`model`	String	The model to use: `tts-1` or `tts-1-hd`.	Required
`input`	String	The text to generate audio for. Maximum length is 5000 characters.	Required
`voice`	String	The voice to use for synthesis. See voice options above.	Required
`response_format`	String	The audio format: `mp3`, `wav`, or `pcm`.	`mp3`
`speed`	Number	The speed of the generated audio. Range: 0.25 to 4.0.	`1.0`

Streaming real-time audio

The Speech API supports real-time audio streaming using chunk transfer encoding. This means audio can begin playing before the full file is generated.

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="https://api.scx.ai/v1",
5    api_key="your-scx-api-key",
6)
7
8response = client.audio.speech.create(
9    model="tts-1",
10    voice="serene-assistant",
11    input="This is a streaming test. The audio will start playing before generation completes.",
12)
13
14# Stream to a file
15with open("stream_output.mp3", "wb") as f:
16    for chunk in response.iter_bytes():
17        f.write(chunk)
18

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="https://api.scx.ai/v1",
5    api_key="your-scx-api-key",
6)
7
8response = client.audio.speech.create(
9    model="tts-1",
10    voice="serene-assistant",
11    input="This is a streaming test. The audio will start playing before generation completes.",
12)
13
14# Stream to a file
15with open("stream_output.mp3", "wb") as f:
16    for chunk in response.iter_bytes():
17        f.write(chunk)
18

Adjusting speed

You can adjust the speed of the generated audio by setting the speed parameter. Values range from 0.25 (slowest) to 4.0 (fastest), with 1.0 being the default normal speed.

1# Slower speech (0.75x speed)
2response = client.audio.speech.create(
3    model="tts-1",
4    voice="alice-bennett",
5    input="This will be spoken more slowly.",
6    speed=0.75,
7)
8
9# Faster speech (1.5x speed)
10response = client.audio.speech.create(
11    model="tts-1",
12    voice="alice-bennett",
13    input="This will be spoken more quickly.",
14    speed=1.5,
15)
16

1# Slower speech (0.75x speed)
2response = client.audio.speech.create(
3    model="tts-1",
4    voice="alice-bennett",
5    input="This will be spoken more slowly.",
6    speed=0.75,
7)
8
9# Faster speech (1.5x speed)
10response = client.audio.speech.create(
11    model="tts-1",
12    voice="alice-bennett",
13    input="This will be spoken more quickly.",
14    speed=1.5,
15)
16

Limitations

Maximum input text length is 5000 characters per request.
For longer content, split your text into chunks and generate audio for each segment.