The Basics

The latest generation of AI models can transform written text into high quality speech and offer a lot of possibilities to make speech expressive. However, it involves specific knowledge (for example understanding machine learning) and careful consideration to make synthetic speech sound convincing for any given use case: it often requires careful annotation, parameter tuning, content editing to account for inefficiencies in the voice, as well as additional processing of the speech produced.

API.audio not only offers a wide variety of different voice models from various providers, but also offers guardrails to make the production of synthetic speech robust and scalable.

API.audio also makes it easy to create a frontend for your users to interact with synthetic speech, assures quality for well known TTS challenges such as the pronunciation of names or to directly connect to an established source to minimise failure rate.

Create speech from text script

When to create a script

We strongly recommend creating a script and rendering speech from that script. This means - rather than using the api.audio text-to-speech API, you first create a script using the script API, and then issue a production request creates speech or even a fully produced audio track based on this script .


  • Additional Guard rails
  • Additional Annotation: Makes it possible to switch speakers within the same track and access batteries-included parameters to personalise content or make it dynamic.
  • Logically decouple content from speech creation.
  • Make the connection between a script and SoundDesign template which allows you to automate production.
  • You can use our input connectors to conveniently create content directly from a source.
  • You can use api.audio’s tooling and best practices to build a front end that allows your users to craft great sounding audio from text.

You can create a script with the ''create script'' API call

What’s Next

Start Building