Customize synthesized voice talent with Amazon Polly.

Black swans at the lake sweaming pair

READ this article about how to synthesize or mimic actual human voices:

We’re not using this technique yet. But we’re keeping an eye on it. Soon we can use Hollywood voice-over talent – for free.

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. With SSML tags, you can customize and control aspects of speech such as pronunciation, volume, and speech rate.

Voices in Amazon Polly

Amazon Polly provides a variety of different voices in multiple languages for use when synthesizing speech from text.

Available Voices

Language Female Names/ID Male Names/ID
Arabic (arb) Zeina
Chinese, Mandarin (cmn-CN) Zhiyu
Danish (da-DK) Naja Mads
Dutch (nl-NL) Lotte Ruben
English, Australian (en-AU) Nicole Russell
English, British (en-GB) Amy


English, Indian (en-IN) Aditi (bilingual with Hindi)


English, US (en-US) Ivy








English, Welsh (en-GB-WLS) Geraint
French (fr-FR) Céline/Celine


French, Canadian (fr-CA) Chantal
German (de-DE) Marlene


Hindi (hi-IN) Aditi (bilingual with Indian English)
Icelandic (is-IS) Dóra/Dora Karl
Italian (it-IT) Carla


Japanese (ja-JP) Mizuki Takumi
Korean (ko-KR) Seoyeon
Norwegian (nb-NO) Liv
Polish (pl-PL) Ewa




Portuguese, Brazilian (pt-BR) Vitória/Vitoria Ricardo
Portuguese, European (pt-PT) Inês/Ines Cristiano
Romanian (ro-RO) Carmen
Russian (ru-RU) Tatyana Maxim
Spanish, European (es-ES) Conchita


Spanish, Mexican (es-MX) Mia
Spanish, US (es-US) Penélope/Penelope Miguel
Swedish (sv-SE) Astrid
Turkish (tr-TR) Filiz
Welsh (cy-GB) Gwyneth

To ensure continuous support for customers, we don’t plan to retire any voices. This applies to both currently available and future voices.

Supported SSML Tags

Amazon Polly supports the following SSML tags:

Unsupported SSML tags in input text generate errors.

Timbre describes the perceived color or quality of a sound, independently from pitch or loudness. Timbre is what differentiates one voice from another, even when their pitch and loudness are the same.

Trained impersonators learn to control these movements to such a degree that they are even able to alter their voices to make themselves sound like somebody else.

An important physiological feature that contributes towards speech timbre is the vocal tract, which is a cavity of air that spans from the top of the vocal folds up to the edge of the lips. There are a variety of muscles that make it possible to change the shape of the vocal tract cavity by making it longer, shorter, wider, or narrower. The effect of these changes causes the resulting speech sounds to be amplified or filtered out.

Pitch sounds higher or lower. Women generally have shorter vocal folds that vibrate more frequently (~180-200 cycles per second). Men have, on average, longer vocal folds that vibrate more slowly (~110 cycles per second). Similarly, the average vocal tract length is shorter for women than it is for men (~14cm vs ~17cm).

There is a natural correlation between vocal fold length and vocal tract length, such that when one increases the other tends to increase as well. The Timbre feature allows developers to change the size of the vocal tract while retaining the ability to control pitch.

Vocal tract and speech synthesis

When you increase the vocal-tract-length, the speaker will sound like they’re bigger. When you decrease it, they will sound smaller.

Here’s how you can modify the length of the speaker’s vocal tract:

+n% or -n%: adjusts the vocal tract length by a relative percentage change in the current voice. For example, +4% or -2%.

n%: adjusts the vocal tract length to an absolute percentage value of the current voice. For example, 104% or 98%.

Vocal tract length can be increased up to +100%, and down to -50%.

To reset the vocal tract length to the default value for the current voice, use

The following example shows how the vocal tract length can be modified, using Joanna’s voice:

This is my original voice, without any modifications. <amazon:effect vocal-tract-length="+15%"> Now, imagine that I am much bigger. </amazon:effect> <amazon:effect vocal-tract-length="-15%"> 
Or, perhaps you prefer my voice when I'm very small? </amazon:effect> You can also control the 
timbre of my voice by making more minor adjustments. <amazon:effect vocal-tract-length="+10%"> For example, by making me sound just a little bigger. </amazon:effect> <amazon:effect vocal-tract-length="-10%"> Or instead, making me sound only somewhat smaller. </amazon:effect> 
changing voice sound


Combining multiple tags

You can combine the vocal-tract-length SSML tag with any other SSML tag that is supported by Amazon Polly. Since vocal tract length and pitch are closely connected in nature, you might get the best results by changing the vocal tract length together with the pitch (by applying the tag).

The pitch and timbre of a person’s voice are connected in human speech.
If you are going to reduce the vocal tract length,
you might consider increasing the pitch as well. If instead you choose to lengthen the vocal tract,
you might also want to lower the pitch.

Samples range from very lifelike speech, to more character-like speech.

Vocal-Tract-Length and Pitch Sample Matrix (.ppt)

Modify the Timbre of Amazon Polly Voices with the New Vocal Tract SSML Feature

Break/Pause Tag

To make Alexa pause while she talks, you can add an SSML tag into the middle of your text. The SSML format for breaks is just one tag, and follows this format:

<break time="3s"/> 
Attribute: Break

The amount of time can either be in seconds (s), or milliseconds (ms). Remember to add the forward slash after the pause length, otherwise your tag won’t work!

Some skills read plenty of text, and begin to sound mechanical if there are no natural pauses. Using breaks between paragraphs can solve that, and the <p></p> markup provides a brief pause that can easily be controlled when scripting the request. There are other times where a longer break is necessary that can be solved by setting a timed break (code below). This can be for up to ten seconds.

Polly puts a ~0.4 second between sentences and a ~0.7 second pause between paragraphs which can also be done in code.

    I already told you I 
    <emphasis level="strong">really like</emphasis> 
    that person.
Attribute: level (emphasis)

strong: Increase the volume and slow down the speaking rate so the speech is louder and slower.
moderate: Increase the volume and slow down the speaking rate, but not as much as when set to strong. This is used as a default if level is not provided.
reduced: Decrease the volume and speed up the speaking rate. The speech is softer and faster.


Represents a paragraph. This tag provides extra-strong breaks before and after the tag. This is equivalent to specifying a pause with break strength=”x-strong”.

<p>This is the first paragraph. There should be a pause after this text is spoken.</p>
<p>This is the second paragraph.</p>


Provides a phonemic/phonetic pronunciation for the contained text. For example, people may pronounce words like “pecan” differently.

When using this tag, Alexa uses the pronunciation provided in the ph attribute rather than the text contained within the tag. However, you should still provide human-readable text within the tags. In the following example, the word “pecan” shown within the tags is never spoken. Instead, Alexa speaks the text provided in the ph attribute:

    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. 
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
alphabet, ph



Modifies the volume, pitch, and rate of the tagged speech.


Modify the rate of the speech:

x-slow, slow, medium, fast, x-fast: Set the rate to a predefined value.
n%: specify a percentage to increase or decrease the speed of the speech:
100% indicates no change from the normal rate.
Percentages greater than 100% increase the rate.
Percentages below 100% decrease the rate.
The minimum value you can provide is 20%.


Raise or lower the tone (pitch) of the speech:

x-low, low, medium, high, x-high: Set the pitch to a predefined value.
+n%: Increase the pitch by the specified percentage. For example: +10%, +5%. The maximum value allowed is +50%. A value higher than +50% is rendered as +50%.
-n%: Decrease the pitch by the specified percentage. For example: -10%, -20%. The smallest value allowed is -33.3%. A value lower than -33.3% is rendered as -33.3%.


Change the volume for the speech:

silent, x-soft, soft, medium, loud, x-loud: Set volume to a predefined value for current voice.
+ndB: Increase volume relative to the current volume level. For example, +0dB means no change of volume. +6dB is approximately twice the current amplitude. The maximum positive value is about +4.08dB.
-ndB: Decrease the volume relative to the current volume level. For example, -6dB means approximately half the current amplitude.

    Normal volume for the first sentence.
    <prosody volume="x-loud">Louder volume for the second sentence</prosody>.
    When I wake up, <prosody rate="x-slow">I speak quite slowly</prosody>.
    I can speak with my normal pitch, 
    <prosody pitch="x-high"> but also with a much higher pitch </prosody>, 
    and also <prosody pitch="low">with a lower pitch</prosody>.
Prosody: rate, pitch, volume


Represents a sentence. This tag provides strong breaks before and after the tag.

This is equivalent to:

Ending a sentence with a period (.).
Specifying a pause with break strength=”strong”.

    <s>This is a sentence</s>
    <s>There should be a short pause before this second sentence</s> 
    This sentence ends with a period and should have the same pause.


Describes how the text should be interpreted. This lets you provide additional context to the text and eliminate any ambiguity on how Alexa should render the text. Indicate how Alexa should interpret the text with the interpret-as attribute.

characters, spell-out: Spell out each letter.
cardinal, number: Interpret the value as a cardinal number.
ordinal: Interpret the value as an ordinal number.
digits: Spell each digit separately .
fraction: Interpret the value as a fraction. This works for both common fractions (such as 3/20) and mixed fractions (such as 1+1/2).
unit: Interpret a value as a measurement. The value should be either a number or fraction followed by a unit (with no space in between) or just a unit.
date: Interpret the value as a date. Specify the format with the format attribute.
time: Interpret a value such as 1’21” as duration in minutes and seconds.
telephone: Interpret a value as a 7-digit or 10-digit telephone number. This can also handle extensions (for example, 2025551212×345).
address: Interpret a value as part of street address.
interjection: Interpret the value as an interjection. Alexa speaks the text in a more expressive voice. For optimal results, only use the supported interjections and surround each speechcon with a pause. For example: <say-as interpret-as=”interjection”>Wow.</say-as>. Speechcons are supported for the languages listed below.
expletive: “Bleep” out the content inside the tag.

Only used when interpret-as is set to date. Set to one of the following to indicate format of the date:


Alternatively, if you provide the date in YYYYMMDD format, the format attribute is ignored. You can include question marks (?) for portions of the date to leave out. For instance, Alexa would speak <say-as interpret-as=”date”>????0922</say-as> as “September 22nd”.

Note that the Alexa service attempts to interpret the provided text correctly based on the text’s formatting even without this tag. For example, if your output speech includes “202-555-1212″, Alexa speaks each individual digit, with a brief pause for each dash. You don’t need to use say-as interpret-as=”telephone” in this case. However, if you provided the text “2025551212”, but you wanted Alexa to speak it as a phone number, you would need to use say-as interpret-as=”telephone”.

    Here is a number spoken as a cardinal number: 
    <say-as interpret-as="cardinal">12345</say-as>.
    Here is the same number with each digit spoken separately:
    <say-as interpret-as="digits">12345</say-as>.
    Here is a word spelled out: <say-as interpret-as="spell-out">hello</say-as>


To include a speechcon in your skill’s text-to-speech response, use the <say-as interpret-as=”interjection”> SSML tag:

    Here is an example of a speechcon. 
    <say-as interpret-as="interjection">abracadabra!</say-as>.

Be sure surround each speechcon with a pause. You can use punctuation (such as a period or comma) or other SSML tags (for instance, <break> or <s>) for pauses.

Speechcon Names – 183

  1. abracadabra
  2. achoo
  3. aha
  4. ahem
  5. ahoy
  6. all righty
  7. aloha
  8. aooga
  9. argh
  10. arrivederci
  11. as you wish
  12. au revoir
  13. aw man
  14. baa
  15. bada bing bada boom
  16. bah humbug
  17. bam
  18. bang
  19. batter up
  20. bazinga
  21. beep beep
  22. bingo
  23. blah
  24. blarg
  25. blast
  26. boing
  27. bon appetit
  28. bonjour
  29. bon voyage
  30. boo
  31. boo hoo
  32. boom
  33. booya
  34. bravo
  35. bummer
  36. caw
  37. cha ching
  38. checkmate
  39. cheerio
  40. cheers
  41. cheer up
  42. chirp
  43. choo choo
  44. clank
  45. click clack
  46. cock a doodle doo
  47. coo
  48. cowabunga
  49. darn
  50. ding dong
  51. ditto
  52. d’oh
  53. dot dot dot
  54. duh
  55. dum
  56. dun dun dun
  57. dynomite
  58. eek
  59. eep
  60. encore
  61. en gard
  62. eureka
  63. fancy that
  64. geronimo
  65. giddy up
  66. good grief
  67. good luck
  68. good riddance
  69. gotcha
  70. great scott
  71. heads up
  72. hear hear
  73. hip hip hooray
  74. hiss
  75. honk
  76. howdy
  77. hurrah
  78. hurray
  79. huzzah
  80. jeepers creepers
  81. jiminy cricket
  82. jinx
  83. just kidding
  84. kaboom
  85. kablam
  86. kaching
  87. kapow
  88. katchow
  89. kazaam
  90. kerbam
  91. kerboom
  92. kerching
  93. kerchoo
  94. kerflop
  95. kerplop
  96. kerplunk
  97. kerpow
  98. kersplat
  99. kerthump
  100. knock knock
  101. le sigh
  102. look out
  103. mamma mia
  104. man overboard
  105. mazel tov
  106. meow
  107. merci
  108. moo
  109. nanu nanu
  110. neener neener
  111. no way
  112. now now
  113. oh boy
  114. oh brother
  115. oh dear
  116. oh my
  117. oh snap
  118. oink
  119. okey dokey
  120. oof
  121. ooh la la
  122. open sesame
  123. ouch
  124. oy
  125. phew
  126. phooey
  127. ping
  128. plop
  129. poof
  130. pop
  131. pow
  132. quack
  133. read ’em and weep
  134. ribbit
  135. righto
  136. roger
  137. ruh roh
  138. shucks
  139. splash
  140. spoiler alert
  141. squee
  142. swish
  143. swoosh
  144. ta da
  145. ta ta
  146. tee hee
  147. there there
  148. thump
  149. tick tick tick
  150. tick-tock
  151. touche
  152. tsk tsk
  153. tweet
  154. uh huh
  155. uh oh
  156. voila
  157. vroom
  158. wahoo
  159. wah wah
  160. watch out
  161. way to go
  162. well done
  163. well well
  164. wham
  165. whammo
  166. whee
  167. whew
  168. woof
  169. whoops a daisy
  170. whoosh
  171. woo hoo
  172. wow
  173. wowza
  174. wowzer
  175. yadda yadda yadda
  176. yay
  177. yikes
  178. yippee
  179. yoink
  180. yoo hoo
  181. you bet
  182. yowza
  183. yowzer
  184. yuck
  185. yum
  186. zap
  187. zing
  188. zoinks

Polly supports SSML tags with two extensions: breaths and voice effects.

The self-closing amazon:breathtag instructs the artificial speaker to take a (fairly life-like) breath of a specified length and volume.

Voice effects include whispering, speaking softly and changing the vocal tract length to make the speaker sound bigger or smaller.

<speak version="1.1" xml:lang="en-US">
<amazon:effect vocal-tract-length="+20%">
No,<amazon:breath duration="long" volume="loud" />
<break />
<break />
your father!
<amazon:breath duration="x-long" volume="x-loud" />
Breaths and voice effects for Polly

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications.  SSML specifies a fair amount of markup for prosody. This includes markup for

  • pitch
  • contour
  • pitch range
  • rate
  • duration
  • volume

It will no longer be good enough to get something published. Differentiation will come from standout quality.

SSML that enables complete control over audio rendering. If you’re familiar with CSS for browser apps, SSML is the equivalent for audio apps. Imagine how poor the user experience for our favorite websites would be if they only rendered basic text. It’s the same way for voice, and why it is so important to learn SSML to further the audio experience.

Prosody is linguistic speech elements – not vowels and consonants. They are properties of syllables and larger units of speech. These include intonation, tone, stress, and rhythm.

Prosody reflects various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance (statement, question, or command); the presence of irony or sarcasm; emphasis, contrast, and focus. It may otherwise reflect other elements of language that may not be encoded by grammar or by choice of vocabulary.

Prosodic speech aspects distinguish between auditory measures (subjective impressions produced in the mind of the listener) and acoustic measures (physical properties of the sound wave that may be measured objectively). Auditory and acoustic measures of prosody do not correspond in a linear way.

In auditory terms, the major variables are:

  • the pitch of the voice (varying between low and high)
  • length of sounds (varying between short and long)
  • loudness, or prominence (varying between soft and loud)
  • timbre (quality of sound)

In acoustic terms, these correspond reasonably closely to:

  • fundamental frequency (measured in hertz, or cycles per second)
  • duration (measured in time units such as milliseconds or seconds)
  • intensity, or sound pressure level (measured in decibels)
  • spectral characteristics (distribution of energy at different parts of the audible frequency range)

Different combinations of these variables are exploited in intonation and stress, as well as rhythm, tempo and loudness. Additional prosodic variables include voice quality and pausing.


Prosodic features are said to be suprasegmental, since they are properties of units of speech larger than the individual segment (though exceptionally it may happen that a single segment may constitute a syllable, and thus even a whole utterance, e.g. “Ah!”). It is necessary to distinguish between the personal, background characteristics that belong to an individual’s voice (for example their habitual pitch range) and the independently variable prosodic features that are used contrastively to communicate meaning (for example, the use of changes in pitch to indicate the difference between statements and questions).


English intonation is based on three aspects:

  • The division of speech into units
  • The highlighting of particular words and syllables
  • The choice of pitch movement (e.g. fall or rise)

These are sometimes known as Tonality, Tonicity and Tone.

Speakers are capable of a wide range of pitch (this is usually associated with excitement), at other times with a narrow range. English has been said to make use of changes in key: shifting one’s intonation into the higher or lower part of one’s pitch range is believed to be meaningful in certain contexts.


Stress makes a syllable prominent. Stress may be studied in relation to individual words (named “word stress” or lexical stress) or in relation to larger units of speech (traditionally referred to as “sentence stress” but more appropriately named “prosodic stress”). Stressed syllables are made prominent by several variables, by themselves or in combination.

Stress is associated with the following:

  • pitch prominence, that is, a pitch level that is different from that of neighbouring syllables, or a pitch movement.
  • increased length (duration).
  • increased loudness (dynamics).
  • differences in timbre: in English and some other languages, stress is associated with aspects of vowel quality (whose acoustic correlate is the formant frequencies or spectrum of the vowel). Unstressed vowels tend to be centralized relative to stressed vowels, which are normally more peripheral in quality.

These cues to stress are not equally powerful. Pitch, length and loudness form a scale of importance in bringing syllables into prominence. Pitch being the most efficacious, and loudness the least.

When pitch prominence is the major factor, the resulting prominence is often called accent rather than stress.

There is considerable variation from language to language concerning the role of stress in identifying words or in interpreting grammar and syntax.


Speech tempo is a measure of the number of speech units of a given type produced within a given amount of time. Speech tempo is believed to vary within the speech of one person according to contextual and emotional factors, between speakers and also between different languages and dialects. However, there are many problems involved in investigating this variance scientifically.

Measurements of speech tempo can be strongly affected by pauses and hesitations. For this reason, it is usual to distinguish between speech tempo including pauses and hesitations and speech tempo excluding them. The former is called speaking rate and the latter articulation rate.

One measure is sounds per second. Rates varying from an average of 9.4 sounds per second for poetry reading to 13.83 per second for sports commentary.

Monosyllables may be pronounced as “clipped”, “drawled” or “held” and polysyllabic utterances may be spoken at “allegro”, “allegrissimo”, “lento” and “lentissimo”.

The widespread view that some languages are spoken more rapidly than others is an illusion. This illusion is related to differences of rhythm and pausing.


Although rhythm is not a prosodic variable in the way that pitch or loudness are, it is usual to treat a language’s characteristic rhythm as a part of its prosodic phonology. It has often been asserted that languages exhibit regularity in the timing of successive units of speech, a regularity referred to as isochrony, and that every language may be assigned one of three rhythmical types: stress-timed (where the durations of the intervals between stressed syllables is relatively constant), syllable-timed (where the durations of successive syllables are relatively constant) and mora-timed (where the durations of successive morae are relatively constant).


Voiced or unvoiced, the pause is a form of interruption to articulatory continuity. Conversation analysis commonly notes pause length. Distinguishing auditory hesitation from silent pauses is one challenge. Contrasting junctures within and without word chunks can aid in identifying pauses.

There are a variety of “filled” pause types. Formulaic language pause fillers include “Like”, “Er” and “Uhm”, and paralinguistic expressive respiratory pauses include the sigh and gasp.

Although related to breathing, pauses may contain contrastive linguistic content, as in the periods between individual words in English advertising voice-over copy sometimes placed to denote high information content, e.g. “Quality. Service. Value.”


Pausing or its lack contributes to the perception of word groups, or chunks. Chunks commonly highlight lexical items or fixed expression idioms. The well-known English chunk “Know what I mean?” sounds like a single word (“No-whuta-meen?”) due to blurring or rushing the articulation of adjacent word syllables, thereby changing the potential open junctures between words into closed junctures.

Cognitive aspects

Intonation is said to have a number of perceptually significant functions in English and other languages, contributing to the recognition and comprehension of speech.


The sentence “They invited Bob and Bill and Al got rejected” is ambiguous when written, although addition of a written comma after either “Bob” or “Bill” will remove the sentence’s ambiguity. But when the sentence is read aloud, prosodic cues like pauses (dividing the sentence into chunks) and changes in intonation will reduce or remove the ambiguity. Moving the intonational boundary in cases such as the above example will tend to change the interpretation of the sentence.


Intonation and stress work together to highlight important words or syllables for contrast and focus. A well-known example is the ambiguous sentence “I never said she stole my money”, where there are seven meaning changes depending on which of the seven words is vocally highlighted.


Prosody plays a role in the regulation of conversational interaction and in signaling discourse structure indicating whether information is new or already established; whether a speaker is dominant or not in a conversation; and when a speaker is inviting the listener to make a contribution to the conversation.


Prosody is also important in signalling emotions and attitudes. When this is involuntary (as when the voice is affected by anxiety or fear), the prosodic information is not linguistically significant. However, when the speaker varies her speech intentionally, for example to indicate sarcasm, this usually involves the use of prosodic features. The most useful prosodic feature in detecting sarcasm is a reduction in the mean fundamental frequency relative to other speech for humor, neutrality, or sincerity. While prosodic cues are important in indicating sarcasm, context clues and shared knowledge are also important.

Native speakers listening to actors reading emotionally neutral text while projecting emotions correctly recognized happiness 62% of the time, anger 95%, surprise 91%, sadness 81%, and neutral tone 76%. When a database of this speech was processed by computer, segmental features allowed better than 90% recognition of happiness and anger, while suprasegmental prosodic features allowed only 44%–49% recognition. The reverse was true for surprise, which was recognized only 69% of the time by segmental features and 96% of the time by suprasegmental prosody.

In typical conversation (no actor voice involved), the recognition of emotion may be quite low, of the order of 50%, hampering the complex interrelationship function of speech advocated by some authors. However, even if emotional expression through prosody cannot always be consciously recognized, tone of voice may continue to have subconscious effects in conversation. This sort of expression stems not from linguistic or semantic effects, and can thus be isolated from traditional linguistic content. Aptitude of the average person to decode conversational implicature of emotional prosody has been found to be slightly less accurate than traditional facial expression discrimination ability; however, specific ability to decode varies by emotion. These emotional have been determined to be ubiquitous across cultures, as they are utilized and understood across cultures. Various emotions, and their general experimental identification rates, are as follows:

  • Anger and sadness: High rate of accurate identification
  • Fear and happiness: Medium rate of accurate identification
  • Disgust: Poor rate of accurate identification

The prosody of an utterance is used by listeners to guide decisions about the emotional affect of the situation. Whether a person decodes the prosody as positive, negative, or neutral plays a role in the way a person decodes a facial expression accompanying an utterance. As the facial expression becomes closer to neutral, the prosodic interpretation influences the interpretation of the facial expression.

Child language

Unique prosodic features have been noted in infant-directed speech (IDS) – also known as baby talk, child-directed speech (CDS), or motherese. Adults, especially caregivers, speaking to young children tend to imitate childlike speech by using higher and more variable pitch, as well as an exaggerated stress. These prosodic characteristics are thought to assist children in acquiring phonemes, segmenting words, and recognizing phrasal boundaries. And though there is no evidence to indicate that infant-directed speech is necessary for language acquisition, these specific prosodic features have been observed in many different languages.

IBM – Cloud

Express-as Attributes – GoodNews , Apology , or Uncertainty.

By default, the IBM Text to Speech service synthesizes text in a neutral declarative style. The service extends SSML with an <express-as> element that produces expressiveness by converting text to synthesized speech in various speaking styles. The element is analogous to the SSML element <say-as>, which specifies text normalization for formatted text such as dates, times, and numbers.

GoodNews expresses a positive, upbeat message.
Apology expresses a message of regret.
Uncertainty conveys an uncertain, interrogative message.

  "text": "<speak>
    I have been assigned to handle your order status request.
    <express-as type=\"Apology\">
      I am sorry to inform you that the items you requested are backordered.
      We apologize for the inconvenience.
    <express-as type=\"Uncertainty\">
      We don&apos;t know when the items will become available. Maybe next week,
      but we are not sure at this time.
    <express-as type=\"GoodNews\">
      But because we want you to be a satisfied customer, we are giving you
      a 50% discount on your order!
Expressions IBM