Annotation

A script allows the following types for annotation

❗️

Use Two Curly Brackets for Personalisation Parameters

A common issue that users have is that they enter {username} not {{username}} be sure to use TWO curly brackets for the personalisation parameters.

ElementDescriptionFormatExample
SSMLAny SSML tag can be included in a script but only tags that affect the content organisation will be interpreted.<break><break time="3s"/>
Personalisation parametersPersonalisation parameters can be used to dynamically replace content components within a script.{{username}}
{{location}}
{{username}} or {{location|London}}

Hello {{username|clara}}, welcome to your script.
Comments & SectionsA comment can be used for informational purposes and will be included in the script but ignored by processing<<sectionName::greeting>><<sectionName:main_part>>
Sound Design AnnotationSound Design Annotation allows you to connect elements of your script to elements of a Sound Design Template (LINK)<<soundSegment::name>>

<<soundEffect::name>>
<<soundSegment::music>>

<<soundEffect::jingle>>
MediaFileIf a media file is used, it can be explicitly defined.

Use the <<media::something>> tag to assign your media files in the script, where media (the key) means "a media file" and something is the value of the media tag. The value can be anything you'd like, and you will use it in the mastering call.
<<media::something>><<media::countdown>>

❗️

Invalid Characters in scriptText

"'[]^`{>}~|</\

SSML

SSML tags allow you to provide additional information how a voice should interpret a text. An api.audio script will only interpret the SSML tags that allow the voice to better interpret the content (eg when breaks occur) but will ignore additional parameters such as voice selection or audio tracks to be used. You can provide the latter when issuing a production request which offers more granular control of how the script will sound in the end. Further, different voice providers will support additional tags. In the following you will find an overview what voice parameters are supported.

See here an example of SSML:

Each morning when I wake up, <prosody volume='loud' rate='x-slow'> I speak quite slowly and deliberately until I have my coffee.</prosody>

Supported SSML Tags (these tags control your script content, and are used in the script API call). See the full list below:

📘

Tag Categories

  • Intonation - Intonation in phonetic is the "melody" pattern of an utterance. Intonation is primarily a matter of variation in the pitch level of the voice (tone), stress and rhythm.
  • Pronunciation - Pronunciation is the phonetic transcription of a given word.
  • Structural - Structural tags are useful when structuring and formatting your script.
  • Pacing - Pacing is the rate of speech. How fast you speak, how often do you use breaks, etc.
TagTag CategoryActionExampleSupported By
<break>pacingAn empty element that controls pausing or other prosodic boundaries between words. If this element is not present between words, the break is automatically determined based on the linguistic context.<speak>
Step 1, take a deep breath. <break time="200ms"/>
Step 2, exhale.
Step 3, take a deep breath again.<break time="500ms"/>

Step 4, exhale.
</speak>
Google,
AWS,
Microsoft,
IBM
<p>,<s>structuralSentence and paragraph elements.<p>,<s>This is sentence one.</s><s>This is sentence two.</s></p>Google,
AWS,
Microsoft,
IBM
<speak>structuralThe root element of the SSML response.<speak>
my SSML content
</speak>
Google,
AWS,
Microsoft,
IBM
<amazon:effect name="whispered">intonationThis tag indicates that the input text should be spoken in a whispered voice rather than as normal speech.<speak>
<amazon:effect name=""whispered"">If you make any noise, </amazon:effect>
she said, <amazon:effect name=""whispered"">they will hear us.</amazon:effect>
</speak>
AWS (only supported for non-neural voices [“Conchita”, “Lucia”, “Enrique”, “Penelope”, “Miguel”])
<amazon:effect phonation="soft">intonationSpeaking softer than normal voice.<speak>
This is Matthew speaking in my normal voice. <amazon:effect phonation=""soft"">This
is Matthew speaking in my softer voice.</amazon:effect>
</speak>
AWS (only supported for non-neural voices [“Conchita”, “Lucia”, “Enrique”, “Penelope”, “Miguel”])
<emphasis>intonationUsed to add or remove emphasis from text contained by the element<speak>
I already told you I <emphasis level=""strong"">really like</emphasis> that person.
</speak>
Google,
AWS (only supported for non-neural voices [“Conchita”, “Lucia”, “Enrique”, “Penelope”, “Miguel”])
<prosody>intonationUsed to customise the pitch, speaking rate, and volume of text contained by the element.<prosody rate="slow" pitch="-2st">Can you hear me now?</prosody>Google,
AWS,
IBM
<mstts:silence>pacingUse the mstts:silence element to insert pauses before or after text, or between the 2 adjacent sentences.The difference between mstts:silence and break is that break can be added to any place in the text, but silence only works at the beginning or end of input text, or at the boundary of 2 adjacent sentences.<speak version=""1.0"" xmlns=""http://www.w3.org/2001/10/synthesis"" xml:lang=""en-US"">
<voice name=""en-US-AriaNeural"">
<mstts:silence type=""Sentenceboundary"" value=""200ms""/>
If we’re home schooling, the best we can do is roll with what each day brings and try to have fun along the way.
</voice>
</speak>
Microsoft
<lang>pronunciationYou can use <lang> to include text in multiple languages within the same SSML request. All languages will be synthesized in the same voice unless you use the VOICE tag to explicitly change the voice.<speak>
<lang xml:lang="fr-FR">Je ne parle pas français.</lang>.
</speak>
Google,
AWS
<phoneme>pronunciationYou can use the <phoneme> tag to produce custom pronunciations of words inline. Text-to-Speech accepts the IPA and X-SAMPA phonetic alphabets.<phoneme alphabet=""ipa"" ph=""ˌmænɪˈtoʊbə"">manitoba
<phoneme alphabet=""x-sampa"" ph='m@""hA:g@%ni:'>mahogany</phoneme>
Google,
IBM
<say-as>pronunciationThis element lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.<speak>
<say-as interpret-as=""cardinal"">12345</say-as>
</speak>
Google,
AWS,
Microsoft,
IBM
<sub>pronunciationIndicate that the text in the alias attribute value replaces the contained text for pronunciation.<sub alias="World Wide Web Consortium">W3C</sub>Google,
AWS,
IBM
<w>pronunciationYou can use the <w> tag to customise the pronunciation of words by specifying the word’s part of speech or alternate meaning.<speak>
The word <say-as interpret-as=""characters"">read may be interpreted
as either the present simple form <w role=""amazon:VB"">read</w>, or the past participle form <w role=""amazon:VBD"">read</w>.
</speak>
AWS
<mark>structuralAn empty element that places a marker into the text or tag sequence. It can be used to reference a specific location in the sequence or to insert a marker into an output stream for asynchronous notification.<speak>
Go from <mark name=""here""/> here, to <mark name=""there""/> there!
</speak>
Google,
AWS,
IBM
<mstts:express-as style="">intonationYou can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm.<mstts:express-as style="cheerful"></mstts:express-as>Microsoft

Personalisation Parameters

Personalisation parameters allow you to create content dynamically, and addressed to each individual.

Some examples of parameters you can choose to use in your script:

TagExample
{{username}}Hi {{username|Anna}}, welcome to your script.
{{location}}Welcome to sunny {{location|Barcelona}}
{{exercise_name}}Up next {{exercise_name|squats}}
{{number_of_reps}}The next exercise consists of {{number_of_reps|15}}
{{duration}}Keep going. Only {{duration|10}} seconds left.
{{speed}}You are running at {{speed|4:40}} minutes per kilometre
{{bpm}}Your heart rate is {{bpm|150}} beats per minute.

Comments

The script supports two types of comments:

TagExample
<<comment::default>><<comment:: please add some horn sounds here>>
<<sectionName::default>><<sectionName::Introduction>>

See here an example of a script:

  • A simple example script can look something like this:

<<sectionName::default>> Hello world.

  • A more complicated example script can look something like this:*

scriptText="<<sectionName::hello>> Hello world {{username|buddy}}"

Script Sections

A script can be divided into script sections. A script section does not only help to organise and better manage your content but also makes it possible to make your content more dynamic by changing parameter settings within a script (such as: speakers, effects, speed, etc.) A section is a comment, and will be included in the script but ignored by processing.

ScriptSections are useful for and need to be defined for when using a sound design and/or when creating audio with multiple speakers.

A sample script with sound segments looks like this:

<<soundSegment::intro>>
<<sectionName::intro>>
This is the first part of speech with a sound in the background
<<soundSegment::main>>
<<sectionName::main>>
This is the main part of my script. Also with a backing track
<<soundSegment::outro>>
<<sectionName::outro>>
And this is the end. 
<<soundEffect::effect1>>
<<sectionName::effect>>

Each SciptSection can be assigned a different speaker, by defining sections in the speech call. Read more about multiple speakers in one script here.

A sample script with two sections for the purpose of using different speakers could look like this:

scriptText="<<sectiongreeting>> Hello world.<<sectionName::goodbye>> It was so nice to meet you."

❗️

The length of a script section is limited to 4000 characters

Sound Design Annotation

Sound Design annotations allow you to create dynamic background tracks and place sound effects throughout your script. You can pick a sound segment to play in the background of a speech section, add a sound effect without speech, or play a speech section without any sound.

To pair a sound segment with a speech section, simply add the sound segment annotation before the speech section annotation:

<<soundSegment::intro>>
<<sectionName::intro>>

To add a sound effect that will play without any speech, add the sound effect annotation:

<<soundEffect::effect1>>
<<sectionName::effect>>

To have a speech section play without any music or sound in the background, define the speech section without defining a sound segment before, and it will play without any background sound.

<<sectionName::hello>>