READ this article about how to synthesize or mimic actual human voices: https://google.github.io/tacotron/publications/speaker_adaptation
We’re not using this technique yet. But we’re keeping an eye on it. Soon we can use Hollywood voice-over talent – for free.
Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. With SSML tags, you can customize and control aspects of speech such as pronunciation, volume, and speech rate.
Voices in Amazon Polly
Amazon Polly provides a variety of different voices in multiple languages for use when synthesizing speech from text.
|Language||Female Names/ID||Male Names/ID|
|Chinese, Mandarin (cmn-CN)||Zhiyu|
|English, Australian (en-AU)||Nicole||Russell|
|English, British (en-GB)||Amy
|English, Indian (en-IN)||Aditi (bilingual with Hindi)
|English, US (en-US)||Ivy
|English, Welsh (en-GB-WLS)||Geraint|
|French, Canadian (fr-CA)||Chantal|
|Hindi (hi-IN)||Aditi (bilingual with Indian English)|
|Portuguese, Brazilian (pt-BR)||Vitória/Vitoria||Ricardo|
|Portuguese, European (pt-PT)||Inês/Ines||Cristiano|
|Spanish, European (es-ES)||Conchita
|Spanish, Mexican (es-MX)||Mia|
|Spanish, US (es-US)||Penélope/Penelope||Miguel|
To ensure continuous support for customers, we don’t plan to retire any voices. This applies to both currently available and future voices.
Supported SSML Tags
Amazon Polly supports the following SSML tags:
|Adding a Pause||<break>|
|Specifying Another Language for Specific Words||<lang>|
|Placing a Custom Tag in Your Text||<mark>|
|Adding a Pause Between Paragraphs||<p>|
|Using Phonetic Pronunciation||<phoneme>|
|Controlling Volume, Speaking Rate, and Pitch||<prosody>|
|Setting a Maximum Duration for Synthesized Speech||<prosody amazon:max-duration>|
|Adding a Pause Between Sentences||<s>|
|Controlling How Special Types of Words Are Spoken||<say-as>|
|Identifying SSML-Enhanced Text||<speak>|
|Pronouncing Acronyms and Abbreviations||<sub>|
|Improving Pronunciation by Specifying Parts of Speech||<w>|
|Adding the Sound of Breathing||<amazon:auto-breaths>|
|Adding Dynamic Range Compression||<amazon:effect name=”drc”>|
|Speaking Softly||<amazon:effect phonation=”soft”>|
|Controlling Timbre||<amazon:effect vocal-tract-length>|
|Whispering||<amazon: effect name=”whispered”>|
Unsupported SSML tags in input text generate errors.
Timbre describes the perceived color or quality of a sound, independently from pitch or loudness. Timbre is what differentiates one voice from another, even when their pitch and loudness are the same.
Trained impersonators learn to control these movements to such a degree that they are even able to alter their voices to make themselves sound like somebody else.
An important physiological feature that contributes towards speech timbre is the vocal tract, which is a cavity of air that spans from the top of the vocal folds up to the edge of the lips. There are a variety of muscles that make it possible to change the shape of the vocal tract cavity by making it longer, shorter, wider, or narrower. The effect of these changes causes the resulting speech sounds to be amplified or filtered out.
Pitch sounds higher or lower. Women generally have shorter vocal folds that vibrate more frequently (~180-200 cycles per second). Men have, on average, longer vocal folds that vibrate more slowly (~110 cycles per second). Similarly, the average vocal tract length is shorter for women than it is for men (~14cm vs ~17cm).
There is a natural correlation between vocal fold length and vocal tract length, such that when one increases the other tends to increase as well. The Timbre feature allows developers to change the size of the vocal tract while retaining the ability to control pitch.
Vocal tract and speech synthesis
When you increase the vocal-tract-length, the speaker will sound like they’re bigger. When you decrease it, they will sound smaller.
Here’s how you can modify the length of the speaker’s vocal tract:
+n% or -n%: adjusts the vocal tract length by a relative percentage change in the current voice. For example, +4% or -2%.
n%: adjusts the vocal tract length to an absolute percentage value of the current voice. For example, 104% or 98%.
Vocal tract length can be increased up to +100%, and down to -50%.
To reset the vocal tract length to the default value for the current voice, use
The following example shows how the vocal tract length can be modified, using Joanna’s voice:
Combining multiple tags
You can combine the vocal-tract-length SSML tag with any other SSML tag that is supported by Amazon Polly. Since vocal tract length and pitch are closely connected in nature, you might get the best results by changing the vocal tract length together with the pitch (by applying the tag).
The pitch and timbre of a person’s voice are connected in human speech.
If you are going to reduce the vocal tract length,
you might consider increasing the pitch as well. If instead you choose to lengthen the vocal tract,
you might also want to lower the pitch.
Samples range from very lifelike speech, to more character-like speech.
Vocal-Tract-Length and Pitch Sample Matrix (.ppt)
Modify the Timbre of Amazon Polly Voices with the New Vocal Tract SSML Feature
To make Alexa pause while she talks, you can add an SSML tag into the middle of your text. The SSML format for breaks is just one tag, and follows this format:
The amount of time can either be in seconds (s), or milliseconds (ms). Remember to add the forward slash after the pause length, otherwise your tag won’t work!
Some skills read plenty of text, and begin to sound mechanical if there are no natural pauses. Using breaks between paragraphs can solve that, and the <p></p> markup provides a brief pause that can easily be controlled when scripting the request. There are other times where a longer break is necessary that can be solved by setting a timed break (code below). This can be for up to ten seconds.
Polly puts a ~0.4 second between sentences and a ~0.7 second pause between paragraphs which can also be done in code.
strong: Increase the volume and slow down the speaking rate so the speech is louder and slower.
moderate: Increase the volume and slow down the speaking rate, but not as much as when set to strong. This is used as a default if level is not provided.
reduced: Decrease the volume and speed up the speaking rate. The speech is softer and faster.
Represents a paragraph. This tag provides extra-strong breaks before and after the tag. This is equivalent to specifying a pause with break strength=”x-strong”.
Provides a phonemic/phonetic pronunciation for the contained text. For example, people may pronounce words like “pecan” differently.
When using this tag, Alexa uses the pronunciation provided in the ph attribute rather than the text contained within the tag. However, you should still provide human-readable text within the tags. In the following example, the word “pecan” shown within the tags is never spoken. Instead, Alexa speaks the text provided in the ph attribute:
Modifies the volume, pitch, and rate of the tagged speech.
Modify the rate of the speech:
x-slow, slow, medium, fast, x-fast: Set the rate to a predefined value.
n%: specify a percentage to increase or decrease the speed of the speech:
100% indicates no change from the normal rate.
Percentages greater than 100% increase the rate.
Percentages below 100% decrease the rate.
The minimum value you can provide is 20%.
Raise or lower the tone (pitch) of the speech:
x-low, low, medium, high, x-high: Set the pitch to a predefined value.
+n%: Increase the pitch by the specified percentage. For example: +10%, +5%. The maximum value allowed is +50%. A value higher than +50% is rendered as +50%.
-n%: Decrease the pitch by the specified percentage. For example: -10%, -20%. The smallest value allowed is -33.3%. A value lower than -33.3% is rendered as -33.3%.
Change the volume for the speech:
silent, x-soft, soft, medium, loud, x-loud: Set volume to a predefined value for current voice.
+ndB: Increase volume relative to the current volume level. For example, +0dB means no change of volume. +6dB is approximately twice the current amplitude. The maximum positive value is about +4.08dB.
-ndB: Decrease the volume relative to the current volume level. For example, -6dB means approximately half the current amplitude.
Represents a sentence. This tag provides strong breaks before and after the tag.
This is equivalent to:
Ending a sentence with a period (.).
Specifying a pause with break strength=”strong”.
Describes how the text should be interpreted. This lets you provide additional context to the text and eliminate any ambiguity on how Alexa should render the text. Indicate how Alexa should interpret the text with the interpret-as attribute.
characters, spell-out: Spell out each letter.
cardinal, number: Interpret the value as a cardinal number.
ordinal: Interpret the value as an ordinal number.
digits: Spell each digit separately .
fraction: Interpret the value as a fraction. This works for both common fractions (such as 3/20) and mixed fractions (such as 1+1/2).
unit: Interpret a value as a measurement. The value should be either a number or fraction followed by a unit (with no space in between) or just a unit.
date: Interpret the value as a date. Specify the format with the format attribute.
time: Interpret a value such as 1’21” as duration in minutes and seconds.
telephone: Interpret a value as a 7-digit or 10-digit telephone number. This can also handle extensions (for example, 2025551212×345).
address: Interpret a value as part of street address.
interjection: Interpret the value as an interjection. Alexa speaks the text in a more expressive voice. For optimal results, only use the supported interjections and surround each speechcon with a pause. For example: <say-as interpret-as=”interjection”>Wow.</say-as>. Speechcons are supported for the languages listed below.
expletive: “Bleep” out the content inside the tag.
Only used when interpret-as is set to date. Set to one of the following to indicate format of the date:
Alternatively, if you provide the date in YYYYMMDD format, the format attribute is ignored. You can include question marks (?) for portions of the date to leave out. For instance, Alexa would speak <say-as interpret-as=”date”>????0922</say-as> as “September 22nd”.
Note that the Alexa service attempts to interpret the provided text correctly based on the text’s formatting even without this tag. For example, if your output speech includes “202-555-1212″, Alexa speaks each individual digit, with a brief pause for each dash. You don’t need to use say-as interpret-as=”telephone” in this case. However, if you provided the text “2025551212”, but you wanted Alexa to speak it as a phone number, you would need to use say-as interpret-as=”telephone”.
To include a speechcon in your skill’s text-to-speech response, use the <say-as interpret-as=”interjection”> SSML tag:
Be sure surround each speechcon with a pause. You can use punctuation (such as a period or comma) or other SSML tags (for instance, <break> or <s>) for pauses.
Speechcon Names – 183
- all righty
- as you wish
- au revoir
- aw man
- bada bing bada boom
- bah humbug
- batter up
- beep beep
- bon appetit
- bon voyage
- boo hoo
- cha ching
- cheer up
- choo choo
- click clack
- cock a doodle doo
- ding dong
- dot dot dot
- dun dun dun
- en gard
- fancy that
- giddy up
- good grief
- good luck
- good riddance
- great scott
- heads up
- hear hear
- hip hip hooray
- jeepers creepers
- jiminy cricket
- just kidding
- knock knock
- le sigh
- look out
- mamma mia
- man overboard
- mazel tov
- nanu nanu
- neener neener
- no way
- now now
- oh boy
- oh brother
- oh dear
- oh my
- oh snap
- okey dokey
- ooh la la
- open sesame
- read ’em and weep
- ruh roh
- spoiler alert
- ta da
- ta ta
- tee hee
- there there
- tick tick tick
- tsk tsk
- uh huh
- uh oh
- wah wah
- watch out
- way to go
- well done
- well well
- whoops a daisy
- woo hoo
- yadda yadda yadda
- yoo hoo
- you bet
Polly supports SSML tags with two extensions: breaths and voice effects.
The self-closing amazon:breathtag instructs the artificial speaker to take a (fairly life-like) breath of a specified length and volume.
Voice effects include whispering, speaking softly and changing the vocal tract length to make the speaker sound bigger or smaller.
Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. SSML specifies a fair amount of markup for prosody. This includes markup for
- pitch range
It will no longer be good enough to get something published. Differentiation will come from standout quality.
SSML that enables complete control over audio rendering. If you’re familiar with CSS for browser apps, SSML is the equivalent for audio apps. Imagine how poor the user experience for our favorite websites would be if they only rendered basic text. It’s the same way for voice, and why it is so important to learn SSML to further the audio experience.
Prosody is linguistic speech elements – not vowels and consonants. They are properties of syllables and larger units of speech. These include intonation, tone, stress, and rhythm.
Prosody reflects various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance (statement, question, or command); the presence of irony or sarcasm; emphasis, contrast, and focus. It may otherwise reflect other elements of language that may not be encoded by grammar or by choice of vocabulary.
Prosodic speech aspects distinguish between auditory measures (subjective impressions produced in the mind of the listener) and acoustic measures (physical properties of the sound wave that may be measured objectively). Auditory and acoustic measures of prosody do not correspond in a linear way.
In auditory terms, the major variables are:
- the pitch of the voice (varying between low and high)
- length of sounds (varying between short and long)
- loudness, or prominence (varying between soft and loud)
- timbre (quality of sound)
In acoustic terms, these correspond reasonably closely to:
- fundamental frequency (measured in hertz, or cycles per second)
- duration (measured in time units such as milliseconds or seconds)
- intensity, or sound pressure level (measured in decibels)
- spectral characteristics (distribution of energy at different parts of the audible frequency range)
Different combinations of these variables are exploited in intonation and stress, as well as rhythm, tempo and loudness. Additional prosodic variables include voice quality and pausing.
Prosodic features are said to be suprasegmental, since they are properties of units of speech larger than the individual segment (though exceptionally it may happen that a single segment may constitute a syllable, and thus even a whole utterance, e.g. “Ah!”). It is necessary to distinguish between the personal, background characteristics that belong to an individual’s voice (for example their habitual pitch range) and the independently variable prosodic features that are used contrastively to communicate meaning (for example, the use of changes in pitch to indicate the difference between statements and questions).
English intonation is based on three aspects:
- The division of speech into units
- The highlighting of particular words and syllables
- The choice of pitch movement (e.g. fall or rise)
These are sometimes known as Tonality, Tonicity and Tone.
Speakers are capable of a wide range of pitch (this is usually associated with excitement), at other times with a narrow range. English has been said to make use of changes in key: shifting one’s intonation into the higher or lower part of one’s pitch range is believed to be meaningful in certain contexts.
Stress makes a syllable prominent. Stress may be studied in relation to individual words (named “word stress” or lexical stress) or in relation to larger units of speech (traditionally referred to as “sentence stress” but more appropriately named “prosodic stress”). Stressed syllables are made prominent by several variables, by themselves or in combination.
Stress is associated with the following:
- pitch prominence, that is, a pitch level that is different from that of neighbouring syllables, or a pitch movement.
- increased length (duration).
- increased loudness (dynamics).
- differences in timbre: in English and some other languages, stress is associated with aspects of vowel quality (whose acoustic correlate is the formant frequencies or spectrum of the vowel). Unstressed vowels tend to be centralized relative to stressed vowels, which are normally more peripheral in quality.
These cues to stress are not equally powerful. Pitch, length and loudness form a scale of importance in bringing syllables into prominence. Pitch being the most efficacious, and loudness the least.
When pitch prominence is the major factor, the resulting prominence is often called accent rather than stress.
There is considerable variation from language to language concerning the role of stress in identifying words or in interpreting grammar and syntax.
Speech tempo is a measure of the number of speech units of a given type produced within a given amount of time. Speech tempo is believed to vary within the speech of one person according to contextual and emotional factors, between speakers and also between different languages and dialects. However, there are many problems involved in investigating this variance scientifically.
Measurements of speech tempo can be strongly affected by pauses and hesitations. For this reason, it is usual to distinguish between speech tempo including pauses and hesitations and speech tempo excluding them. The former is called speaking rate and the latter articulation rate.
One measure is sounds per second. Rates varying from an average of 9.4 sounds per second for poetry reading to 13.83 per second for sports commentary.
Monosyllables may be pronounced as “clipped”, “drawled” or “held” and polysyllabic utterances may be spoken at “allegro”, “allegrissimo”, “lento” and “lentissimo”.
The widespread view that some languages are spoken more rapidly than others is an illusion. This illusion is related to differences of rhythm and pausing.
Although rhythm is not a prosodic variable in the way that pitch or loudness are, it is usual to treat a language’s characteristic rhythm as a part of its prosodic phonology. It has often been asserted that languages exhibit regularity in the timing of successive units of speech, a regularity referred to as isochrony, and that every language may be assigned one of three rhythmical types: stress-timed (where the durations of the intervals between stressed syllables is relatively constant), syllable-timed (where the durations of successive syllables are relatively constant) and mora-timed (where the durations of successive morae are relatively constant).
Voiced or unvoiced, the pause is a form of interruption to articulatory continuity. Conversation analysis commonly notes pause length. Distinguishing auditory hesitation from silent pauses is one challenge. Contrasting junctures within and without word chunks can aid in identifying pauses.
There are a variety of “filled” pause types. Formulaic language pause fillers include “Like”, “Er” and “Uhm”, and paralinguistic expressive respiratory pauses include the sigh and gasp.
Although related to breathing, pauses may contain contrastive linguistic content, as in the periods between individual words in English advertising voice-over copy sometimes placed to denote high information content, e.g. “Quality. Service. Value.”
Pausing or its lack contributes to the perception of word groups, or chunks. Chunks commonly highlight lexical items or fixed expression idioms. The well-known English chunk “Know what I mean?” sounds like a single word (“No-whuta-meen?”) due to blurring or rushing the articulation of adjacent word syllables, thereby changing the potential open junctures between words into closed junctures.
Intonation is said to have a number of perceptually significant functions in English and other languages, contributing to the recognition and comprehension of speech.
The sentence “They invited Bob and Bill and Al got rejected” is ambiguous when written, although addition of a written comma after either “Bob” or “Bill” will remove the sentence’s ambiguity. But when the sentence is read aloud, prosodic cues like pauses (dividing the sentence into chunks) and changes in intonation will reduce or remove the ambiguity. Moving the intonational boundary in cases such as the above example will tend to change the interpretation of the sentence.
Intonation and stress work together to highlight important words or syllables for contrast and focus. A well-known example is the ambiguous sentence “I never said she stole my money”, where there are seven meaning changes depending on which of the seven words is vocally highlighted.
Prosody plays a role in the regulation of conversational interaction and in signaling discourse structure indicating whether information is new or already established; whether a speaker is dominant or not in a conversation; and when a speaker is inviting the listener to make a contribution to the conversation.
Prosody is also important in signalling emotions and attitudes. When this is involuntary (as when the voice is affected by anxiety or fear), the prosodic information is not linguistically significant. However, when the speaker varies her speech intentionally, for example to indicate sarcasm, this usually involves the use of prosodic features. The most useful prosodic feature in detecting sarcasm is a reduction in the mean fundamental frequency relative to other speech for humor, neutrality, or sincerity. While prosodic cues are important in indicating sarcasm, context clues and shared knowledge are also important.
Native speakers listening to actors reading emotionally neutral text while projecting emotions correctly recognized happiness 62% of the time, anger 95%, surprise 91%, sadness 81%, and neutral tone 76%. When a database of this speech was processed by computer, segmental features allowed better than 90% recognition of happiness and anger, while suprasegmental prosodic features allowed only 44%–49% recognition. The reverse was true for surprise, which was recognized only 69% of the time by segmental features and 96% of the time by suprasegmental prosody.
In typical conversation (no actor voice involved), the recognition of emotion may be quite low, of the order of 50%, hampering the complex interrelationship function of speech advocated by some authors. However, even if emotional expression through prosody cannot always be consciously recognized, tone of voice may continue to have subconscious effects in conversation. This sort of expression stems not from linguistic or semantic effects, and can thus be isolated from traditional linguistic content. Aptitude of the average person to decode conversational implicature of emotional prosody has been found to be slightly less accurate than traditional facial expression discrimination ability; however, specific ability to decode varies by emotion. These emotional have been determined to be ubiquitous across cultures, as they are utilized and understood across cultures. Various emotions, and their general experimental identification rates, are as follows:
- Anger and sadness: High rate of accurate identification
- Fear and happiness: Medium rate of accurate identification
- Disgust: Poor rate of accurate identification
The prosody of an utterance is used by listeners to guide decisions about the emotional affect of the situation. Whether a person decodes the prosody as positive, negative, or neutral plays a role in the way a person decodes a facial expression accompanying an utterance. As the facial expression becomes closer to neutral, the prosodic interpretation influences the interpretation of the facial expression.
Unique prosodic features have been noted in infant-directed speech (IDS) – also known as baby talk, child-directed speech (CDS), or motherese. Adults, especially caregivers, speaking to young children tend to imitate childlike speech by using higher and more variable pitch, as well as an exaggerated stress. These prosodic characteristics are thought to assist children in acquiring phonemes, segmenting words, and recognizing phrasal boundaries. And though there is no evidence to indicate that infant-directed speech is necessary for language acquisition, these specific prosodic features have been observed in many different languages.
IBM – Cloud
Express-as Attributes – GoodNews , Apology , or Uncertainty.
By default, the IBM Text to Speech service synthesizes text in a neutral declarative style. The service extends SSML with an <express-as> element that produces expressiveness by converting text to synthesized speech in various speaking styles. The element is analogous to the SSML element <say-as>, which specifies text normalization for formatted text such as dates, times, and numbers.
GoodNews expresses a positive, upbeat message.
Apology expresses a message of regret.
Uncertainty conveys an uncertain, interrogative message.