READ this article about how to synthesize or mimic actual human voices: https://google.github.io/tacotron/publications/speaker_adaptation
We’re not using this technique yet. But we’re keeping an eye on it. Soon we can use Hollywood voice-over talent – for free.
Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. With SSML tags, you can customize and control aspects of speech such as pronunciation, volume, and speech rate.
Voices in Amazon Polly
Naja – Female
Lotte – Female
English (Australian) (en-AU)
English (British) (en-GB)
English (Indian) (en-IN)
English (US) (en-US)
English (Welsh) (en-GB-WLS)
French (Canadian) (fr-CA)
Portuguese (Brazilian) (pt-BR)
Portuguese (European) (pt-PT)
Spanish (Castilian) (es-ES)
Spanish (Latin American) (es-US)
“1” Nicole – English (Australia) – young woman
“2” Russell – English (Australia) – tough sounding guy.
“16” Raveena – English (India) – a little more mature woman – intelligent sounding.
“34” Amy – English (UK) – young woman throaty? proper cut-glass sounding
“35” Brian – English (UK) – slow deliberate articulate
“36” Emma – English (UK) – higher timber, younger girlish? sounding Engish
“37” Joanna – English (US) – young woman, no accent, confident voice
“38” Joey – English (US) – slower, deliberate, young man 1.25 speed is better
“39” Salli – English (US) – slow, lower voice tone, young woman
“40” Justin – English (US) – young teen girl, high timbre, early teen sounding
“41” Kendra – English (US) – middle-age woman, lower voice, authoritative
“42” Kimberly – English (US) – middle age, medium timbre
“43” Ivy – English (US) – very high timbre, almost fairy-like girlish
https://www.youtube.com/watch?v=3UgpfSp2t6k (Amy Walker)
https://www.youtube.com/watch?v=4NriDTxseog (AMy Wlaker)
if your accent broadens, you start to speak with a more noticeable accent.
cut-glass UK. used about a way of speaking in which words are pronounced very clearly and carefully, in a way that is typical of someone from a high social class: a cut-glass accent.
a slow, lazy way of speaking or an accent with unusually prolonged vowel sounds.
“a Texas drawl”
Estuary English is an English dialect or accent associated with South East England, especially the area along the River Thames and its estuary, centering around London. it has some of the phonetic features of working-class London speech spreading at various rates socially into middle-class speech and geographically into other accents of southeastern England.
types: Cockney, Essex, Berkshire
The United States does not have a concrete ‘standard’ accent in the same way that Britain has Received Pronunciation. Nonetheless, a form of speech known to linguists as General American is perceived by many Americans to be “accent-less”, meaning a person who speaks in such a manner does not appear to be from anywhere. The region of the United States that most resembles this is the central Midwest, specifically eastern Nebraska (including Omaha and Lincoln), southern and central Iowa (including Des Moines), parts of Missouri, Ohio and western Illinois (including Peoria and the Quad Cities, but not the Chicago area).
AKA rhotic: relating to or denoting a dialect or variety of English, e.g., Midwestern American English, in which r is pronounced before a consonant (as in hard ) and at the ends of words (as in far ).
Warm, silvery baritone timbre in native British ‘Received Pronunciation’ English similar to actors Anthony Hopkins and Richard Burton, particularly suited to documentary and narration. Concise, crisp, clear…
British English, RP, warm, friendly, BBC style, with no discernible accent.
Cultural and ethnic American English
African American Vernacular English (Ebonics)
Cajun Vernacular English
Latino Vernacular Englishes
Pennsylvania Dutch English
General American English
General American: the “standard” or “mainstream” spectrum of American English.
Regional and local American English
Eastern New England
Boston and Maine: Greater Boston, including most of eastern Massachusetts
Mid-Atlantic (Delaware Valley)
North Midland: Omaha, Lincoln, Columbia, Springfield, Muncie, Columbus, etc.
South Midland: Oklahoma City, Tulsa, Topeka, Wichita, Kansas City, St. Louis (in transition), Decatur, Indianapolis, Cincinnati, Dayton, etc.
Southern Appalachian: Linden, Birmingham, Chattanooga, Knoxville, Asheville, and Greenville
Texas Southern: Lubbock, Odessa, and Dallas
New York City
Inland Northern: Chicago, Detroit, Milwaukee, Western New York, the Lower Peninsula of Michigan, and most of the U.S. Great Lakes region
Western New England: Connecticut, Hudson Valley, western Massachusetts, and Vermont
North Central (Upper Midwestern): Brockway, Minot, Bismark, Bemidji, Chisholm, Duluth, Marquette, etc.
Upper Peninsula of Michigan (“Yooper”)
Western Pennsylvania (Pittsburgh)
to talk posh:
means that she speaks with an accent which is of a higher class to her natural accent. It could also mean that she phrases things in a different way to how she would normally say things, for example by using impressive or long words and avoiding slang.
classified as a soprano, mezzo-soprano, or contralto (alto) if you are a woman, and a countertenor, tenor, baritone, or bass if you are a man.
voice for women is a better indicator of age than voice type
weight – light voices, bright and agile; heavy voices, powerful, rich, and darker
timbre or color – unique voice quality and texture
age and experience
Young, light, bright
High, bright, flexible
High, dark, flexible
Warm, legatto, full
Bright, metallic, theatrical
Powerful, young, full
Powerful, dark, rich
Agile, rich, bright
Strong, flexible, lachrymose
Rich, powerful, imposing
Low, full, warm
Full, low, stamina
Smooth, flexible, sweet
Brilliant, warm, agile
Mature, rich, powerful
AudioBooks and Podcasts
Digital Voice Actors
Mobile Brand Voice Actors
Articulate Artificial Voice Talent using Automatic Digital Production: unseen narrator talent for podcasts.
Passionate – expressive, enthusiastic, heartfelt, action-oriented, attractive
Quirky – irreverent, unexpected, contrarian, agressive
Authentic – genuine, trustworthy, engaging, direct, believable, calming, authoritative, announcer, intellectual, serious
Not commercial but narration voices.
The podcast is a relatively new form of media that has truly revolutionized the world of the spoken word. Thanks to the development of the podcast, listeners can now access interesting and important information on a wide array of topics from wherever they happen to have smartphones or other mobile devices. The amazing vocal artists at Voices.com know how to take material on any subject and turn it into a fascinating podcast. It won’t be long before your listeners are clamoring for more!
Ready-to-use Voice Talent
No editing required.
Unlimited, no word-count limitations.
PODCASTS – OPENER AND CLOSER
2 x 15 seconds or less cost $100 free
Minimum audio book: $280 free
How many words does a voice talent read per minute? As a general guide; reading slow – 100, regular – 150, fast – 200.
Estimates – 300 words = 2 minutes | 900 words = 5 mins | 1,800 words = 10 mins | 2,750 = 15 mins | 3,500 = 20 mins | 5,500 = 30 mins | 7,500 = 45 mins | 9,500 = 60 mins
10 Types of Voice Over Styles
Voice over artists can work in many different industries. This is what makes the work of voice over artists so exciting. In each industry, voice over artists must have a voice over style. The following are 10 types of voice over styles to be aware of when seeking work in this industry or looking for an artist for your project.
According to Edge Studio Voice Over Industry Center, about 8 percent of voice over artists work in television commercials. It’s a small part of the industry, but it can be the most lucrative. Breaking into this market means opening the possibility of getting more commercials, and soon your voice is well known among producers.
Most voice over artists work in narration; it makes up most of their income. Actually, 92 percent of voice over artist’s work comes from narration. Usually, this work comprises of making audiobooks.
Voice over artists work as announcers. They can work for airports, stadiums, train stations, malls, and other large public places.
Similar to the narration field, voice over artists work in biography by telling the life stories of celebrities and politicians. They can work in biography films by telling the stories of the people in them.
Character and animation work is fun for many voice over artists. They must speak like a child to make toys talk or cartoons come alive. Character/Animation voice over artists often work as narrators for children’s books, videos and video games.
Voice over artists work in the corporate world. They are the voices in training videos, promotional material and human resources videos. They also work for trade shows as announcers.
Business owners are now seeking voice over artists to explain what they are selling and how to use their website. Voice over artists create videos for website tours, banner ads, and online tutorials.
A voice over artist with a motivating voice can be perfect for an exercise or inspiration video. This voice is calm, soothing, and powerful in a peaceful way.
Like voice over artists who work in the corporate world, those working in education are in training and educational videos. They can work for schools, universities, and other educational institutions.
Some films are not biographies, but they still want a voice over to explain what is going on in the scenes. Voice over artists can have a large or small part in these films.
This is just 10 of the most types of voice over styles available. As you probably noticed, some of them overlap. As you’re looking to choose a style of your own, consider these options. If you’re seeking a voice over artist for your project, now you know which style to request.
1. Instructor (formal, didactic voice over)
2. Real Person (informal voice over)
3. Spokesperson (advocate, authoritative voice over)
4. Narrator (omniscient storyteller)
5. Announcer (sets the stage and calls for action)
Let’s explore these types of character roles in detail.
When teaching someone on what to do, for example, a corporate training video or children’s game, the voice over best suited for this kind of project is a straightforward, didactic and educated voice. The role of this particular voice talent is to instruct or provide information to fulfill a specific goal or purpose.
2. Real Person
Projects requiring a more casual approach often benefit from relatable, genuine voice overs. These voice overs are referred to as “Real Person” voice overs, commonly known as the “regular guy” or the “girl next door”. The character is homegrown, sensible, and friendly with a touch of familiarity and provides a more intimate interpretation that instills trust.
A Spokesperson can be on camera or off camera depending on the medium you are using. The role of a spokesperson is generally played by a confident, charismatic person able to promote a cause, product, or service with ease and authority. A voice over of this nature needs to be driven, optimistic and assured.
Storytelling is where the Narrator is most at home. Omniscient, courteous and honest, a Narrator’s job is to provide an audio landscape for a listener, briefing them on background information, posing questions, and providing solutions as they guide their audience through a program or documentary. Narrators can be male or female, and the most important factors are that they can communicate clearly and engagingly.
The Announcer, often heard live at events, on commercials, promos or introducing segments for podcasts, is a product of the broadcast age, most celebrated at its height in the Golden Age of Radio and early television broadcasts. Announcers can introduce an idea and assertively make a call for action at the conclusion of a commercial advertisement or short video. One common misconception is that an announcer has to sound like an announcer from decades ago, however modern announcers act more like Narrators, and in many cases, adopt the Real Person approach.
Here are some of the best voice over markets to find work:
Narration: Narration makes up a huge portion of voice over work. For many in the voice talent industry, narration voice overs comprise the majority of their work. There is a lot of work out there in the form of recording audiobooks.
Commercials: Voice over work for commercials is a smaller industry. Estimates show that only around 8 percent of voice over actors find work in commercials. But we’ve added it to the shortlist because of how lucrative the commercial industry can be.
Announcement: Ever wondered whose voice you’re hearing in airports, stadiums, malls, and other large public spaces? In many cases these recordings are the work of a voice over artist.
Corporate: The corporate world has plenty of need for voice over work. You have training videos, human resources material, or similar such material that warrant the need for voice over talent.
Education: There is a seemingly endless need for schools, universities and educational institutions in the form of voice over work. In this realm you can expect your voice over work to cover training and educational videos.
if your narration demo is simply tagged “narration,” it won’t show up in as many search results as a demo that is tagged “corporate, warm, trustworthy.” If the system can’t tell what you offer, it won’t know what auditions to put in your inbox! Beef up your profile with specifics on your vocal age range, styles (perky mom, corporate trainer, casual best friend), and tones (friendly, sympathetic, wry, sincere, etc.). These same descriptors need to be on your website, and ideally will be reflected in the overall look of your logo and website design.
Tips and Tricks
native voice actor is the unquestionable answer in order to make a trustworthy explainer video. Make your video native by speaking in regional language if necessary.
Using silence in the right places.
Attached is a screengrab of a page under test: http://tearelabs.com/last-rocket/ This is a page with one graphic header and about 7,500 words of content. It is a short story. I’m using the voice #2 Russell an Aussie accent.
What it tells us in when a player (Speechkit.io) is installed, performance (speed) is borderline mediocre for mobile. Google estimates their will be a 20 percent visitor loss dues to loading time. More on that in a minute.
I’ve thought about your question as to how much a service like this is worth or how much would I pay for it. Here’s my observations based on experience and intuition.
Market size: The WordPress world is big. But not as big as Automattic claims. There are bloggers who debate the accuracy of the reported number of sites actually using WordPress. Half the users are North American. The rest tend toward usage by English speaking countries. Automattic claims over 30 percent of the web is using their CMS. This statistic is regurgitated over and over by unknowing and apathetic reporting.
Automatic’s estimates are inflated by sampling the better “creme” from their own trough. More realistic estimates from other sources say perhaps best-case 15 to 19 percent of the web is using WordPress. 77 million blogs are hosted on WordPress.com. It’s estimated that half are abandoned and still floating as Internet zombies. This artificially inflates user numbers.
But the majority of users appear to be bloggers who are self-publishing their ideas, philosophies, stories, and information.
75 percent of the domains registered in the US are speculative reselling. They are unused waiting for purchasers (suckers) to need or want them. This also inflates or skew statistical reporting of the size of the Internet.
My experience is that 99 percent of WordPress users expect all things word press to be free. Free plugins. Free themes. Free CMS. Even free images and graphics.
How to monetize your plugin? Can a company make money using a freemium plugin? Unknown. Three things are necessary:
1. The plugin must solve a niche problem or need.
2. The niche must have a community (AKA The List). This is the method of communicating with them.
3. The offer must be good. That includes terms of sale, specifications, duration of service, price, and service (and probably more).
So there things: need, list, offer.
You’ll notice that didn’t include branding or logos or any other expressive decoration. While I am a designer at heart, you can get those things “out of a can” or “off the shelf” nowadays that are sufficient for starting or testing. Don’t waste money on this stuff. Save your resources for other more valuable things.
My gut says your plugin should be classified as a competition to voiceover talent especially those focused on podcasting (online platforms). the growth rate of this industry is 35 percent by the end of 2024.
You are pricing against live voice over talent. And the BIG need you fill is optimization of digital mobile user experience. And would be described as a marketing automation product.
Voice over talent (or actors) is used for various mediums and deliverables but those I feel your plugin is best suited for include: E-learning, Podcasts and Audiobooks: Mixed Messages. Podcasts and audiobooks continue to be mediums that stir interest, but seem to lose participants just as fast as they gain them.
In the US: Voice Over market is I quote:
Gen X and Millennials are the Top Focus.
Buying power: Overwhelmingly, Gen X (35- 64 yrs old) and Millennials (18 – 34 yrs old) fall within the bullseye that media is targeting. Interestingly, Gen X (54%) is slightly ahead of Millennials (40%).
Meanwhile, Gen Z is still emerging at 3%, while Baby Boomers and the Great Generation are balancing out the scales at the other end of the spectrum, with 2% and 1% respectively.
Buyers: The vast majority of the respondents opted for a voice actor who could sound like a peer (73% want a ‘same age’ voice).
Increasingly, this peer must clearly represent diversity – so their voice, and the brand they’re voicing by extension, can be perceived as relevant, no matter where in the world the message is sent.
The Conversational Persona is Dominating Personas that are approachable and conversational in nature have risen significantly in popularity. The ‘executive’ persona, which is professional and friendly, has risen in popularity by 102%, and the Girl Next Door had become 48% more popular in 2016 compared to 2015.
The conversational ‘type’ of persona includes those such as ‘everyman, mother, narrator and storyteller.’
The announcer persona is dwarfed in comparison.
The Most Desirable Vocal Qualities
A persona that held authority, while remaining approachable, friendly, warm, and informative.
In essence, this persona embodied the qualities that most characterize as ‘conversational, a voice that speaks to the listener as a peer – albeit, a peer with the clout of an expert.
The new type of voice that everyone will be hearing in 2017 is one that isn’t perfect; it’s slightly flawed and more real, like that of your best friend. It’s the guy or girl next door. It’s genuine, and sounds nothing like a celebrity endorsement.
At the same time, 14% share that they are increasingly needing a more neutral sound, perhaps so as not to alienate anyone in a larger customer base. 40% report no change at all (that is, they’re still needing neutral English). The rest simply don’t worry about this.
Bloggers want to improve mobile user experience.
While Amazon Alexa (26 percent of voice-first users) has done well for “reading” news and podcast applications, are the they one to compete against? Essentially you are turning any browser into an Alexa competitor.
Many of these applications allow users to download podcasts or stream them on demand as an alternative to downloading. Many podcast players (apps as well as dedicated devices) allow listeners to skip around the podcast and control the playback speed.
Podcasts are usually free of charge to listeners and can often be created for little to no cost, which sets them apart from the traditional model of “gate-kept” media and production tools.
Podcast creators can monetize their podcasts by allowing companies to purchase ad time, as well as via sites such a Patreon, which provides special extras and content to listeners for a fee. It is very much a horizontal media form: producers are consumers, consumers may become producers, and both can engage in conversations with each other.
Podcasting, once an obscure method of spreading information, has become a recognized medium for distributing audio content, whether for corporate or personal use. Podcasts are similar to radio programs, but they are audio files. Listeners can play them at their convenience, using devices that have become more common than portable broadcast receivers.
On November 16, 2006, the Apple Trademark Department stated that Apple does not object to third-party usage of “the generic term” “podcast” to refer to podcasting services.
A podcast novel (also known as a serialized audiobook or podcast audiobook) is a literary format that combines the concepts of a podcast and an audiobook. Like a traditional novel, a podcast novel is a work of long literary fiction; however, this form of the novel is recorded into episodes that are delivered online over a period of time and in the end available as a complete work for download. The episodes may be delivered automatically via RSS, through a website, blog, or another syndication method. These files are either listened to directly on a user’s computer or loaded onto a portable media device to be listened to later.
The types of novels that are podcasted vary from new works from new authors that have never been printed, to well-established authors that have been around for years, to classic works of literature that have been in print for over a century. In the same style as an audiobook, podcast novels may be elaborately narrated with separate voice actors for each character and sound effects, similar to a radio play. Other podcast novels have a single narrator reading the text of the story with little or no sound effects.
Podcast novels are distributed over the Internet, commonly on a weblog. Podcast novels are released in episodes on a regular schedule (e.g., once a week) or irregularly as each episode is released when completed. They can either be downloaded manually from a website or blog or be delivered automatically via RSS or another method of syndication. Ultimately, a serialized podcast novel becomes a completed audiobook.
Some podcast novelists give away a free podcast version of their book as a form of promotion. Some such novelists have even secured publishing contracts to have their novels printed. Podcast novelists have commented that podcasting their novels lets them build audiences even if they cannot get a publisher to buy their books. These audiences then make it easier to secure a printing deal with a publisher at a later date. These podcast novelists also claim the exposure that releasing a free podcast gains them makes up for the fact that they are giving away their work for free.
The advent of user-generated content marked a shift among media organizations from creating online content to providing facilities for amateurs to publish their own content.
Mobile Knowledge Transfer: Podcasting is also used by corporations to disseminate information faster and more easily. It can be seen as a further development of Rapid E-Learning as the content can be created fast and without much effort. Learners can learn in idle times which saves time and money for them and the organizations. Audio podcasts can be used during other activities like driving a car, or traveling by train/bus. A group often targeted is the salesforce, as they are highly mobile. There podcasting can be used for sales enablement (see case study) with the goal of having the sales employee aware and knowledgeable on the companies products, processes, initiatives etc. An often-used format is expert interviews with statements of experienced role models to bring across also informal/tacit knowledge.
Professional Development: Professional development podcasts exist for educators. Some podcasts may be general in nature or may be slightly more specific and focus on the use of interactive white boards in the classroom.
Fiction. Podcasts like Escape Pod are used to distribute short stories in audiobook format. Other podcasts distribute stories in the format of radio drama.
Podcasting as a Business Content Marketing Strategy. Podcasts create brand fanatics, people who are deeply invested in who podcasters are as people and as business professionals. This is essence of long-form content marketing. Every minute that a customer or prospect listens to a podcaster speak with authority the podcaster is establishing themselves as a thought-leader. Conceptually the more time an audience spends with the podcasters content the more authority the podcaster will acquire.
Today there are more than 115,000 English-language podcasts available on the internet, and dozens of websites available for distribution at little or no cost to the producer or listener.
(AudioFeast shut down its service in July 2005 due to the unwillingness of its free customers to pay for its $49.95 paid annual subscription service, and a lack of a strong competitive differentiation in the market with the emergence of free RSS podcatchers.)
In March 2007 after being On Air talent and being fired from KYSR (STAR) in Los Angeles, Ca. Jack and Stench started their own subscription based podcast. At $5.00 per subscription, their loyal fans had access to a one-hour podcast, free of any commercials. They have had free local events at bars, ice cream parlors and restaurants all around Southern California. With a successful run of 5 years and over 1200 podcasts (as of March 2012) The Jack and Stench show is among the longest running monetized podcasts.
The BBC noted in 2011 that more people (eight million in the UK or about 16% of the population, with half listening at least once a week – a similar proportion to the USA) had downloaded podcasts than used Twitter.
For example, podcasting has been picked up by some print media outlets, which supply their readers with spoken versions of their content.
Although firm business models have yet to be established, podcasting represents a chance to bring additional revenue to a newspaper through advertising, subscription fees and licensing.
Veteran podcaster Gary Leland joined forces with Dan Franks and Jared Easley to form a new international conference for podcasters in early 2014 called Podcast Movement. Unlike other new media events, Podcast Movement was the first conference of its size in over a decade that was focused specifically on podcasting, and has tracks for both new and experienced podcast creators, as well as industry professionals. The fourth annual conference is scheduled for August 2017 in Anaheim, California.
Show status in page or post directory of audio state.
No way of knowing what voice was used.
Refresh all audio button doesn’t go away. (unless you press save changes).
Does control panel remote volume affect playback on site?
Dropcap shortcodes error
Does the control panel editor work? Reload?
Far futures expiration setting?
iPad text covered on player
Amazon Polly provides a variety of different voices in multiple languages for use when synthesizing speech from text.
|Language||Female Names/ID||Male Names/ID|
|Chinese, Mandarin (cmn-CN)||Zhiyu|
|English, Australian (en-AU)||Nicole||Russell|
|English, British (en-GB)||Amy
|English, Indian (en-IN)||Aditi (bilingual with Hindi)
|English, US (en-US)||Ivy
|English, Welsh (en-GB-WLS)||Geraint|
|French, Canadian (fr-CA)||Chantal|
|Hindi (hi-IN)||Aditi (bilingual with Indian English)|
|Portuguese, Brazilian (pt-BR)||Vitória/Vitoria||Ricardo|
|Portuguese, European (pt-PT)||Inês/Ines||Cristiano|
|Spanish, European (es-ES)||Conchita
|Spanish, Mexican (es-MX)||Mia|
|Spanish, US (es-US)||Penélope/Penelope||Miguel|
To ensure continuous support for customers, we don’t plan to retire any voices. This applies to both currently available and future voices.
Supported SSML Tags
Amazon Polly supports the following SSML tags:
|Adding a Pause||<break>|
|Specifying Another Language for Specific Words||<lang>|
|Placing a Custom Tag in Your Text||<mark>|
|Adding a Pause Between Paragraphs||<p>|
|Using Phonetic Pronunciation||<phoneme>|
|Controlling Volume, Speaking Rate, and Pitch||<prosody>|
|Setting a Maximum Duration for Synthesized Speech||<prosody amazon:max-duration>|
|Adding a Pause Between Sentences||<s>|
|Controlling How Special Types of Words Are Spoken||<say-as>|
|Identifying SSML-Enhanced Text||<speak>|
|Pronouncing Acronyms and Abbreviations||<sub>|
|Improving Pronunciation by Specifying Parts of Speech||<w>|
|Adding the Sound of Breathing||<amazon:auto-breaths>|
|Adding Dynamic Range Compression||<amazon:effect name=”drc”>|
|Speaking Softly||<amazon:effect phonation=”soft”>|
|Controlling Timbre||<amazon:effect vocal-tract-length>|
|Whispering||<amazon: effect name=”whispered”>|
Unsupported SSML tags in input text generate errors.
Timbre describes the perceived color or quality of a sound, independently from pitch or loudness. Timbre is what differentiates one voice from another, even when their pitch and loudness are the same.
Trained impersonators learn to control these movements to such a degree that they are even able to alter their voices to make themselves sound like somebody else.
An important physiological feature that contributes towards speech timbre is the vocal tract, which is a cavity of air that spans from the top of the vocal folds up to the edge of the lips. There are a variety of muscles that make it possible to change the shape of the vocal tract cavity by making it longer, shorter, wider, or narrower. The effect of these changes causes the resulting speech sounds to be amplified or filtered out.
Pitch sounds higher or lower. Women generally have shorter vocal folds that vibrate more frequently (~180-200 cycles per second). Men have, on average, longer vocal folds that vibrate more slowly (~110 cycles per second). Similarly, the average vocal tract length is shorter for women than it is for men (~14cm vs ~17cm).
There is a natural correlation between vocal fold length and vocal tract length, such that when one increases the other tends to increase as well. The Timbre feature allows developers to change the size of the vocal tract while retaining the ability to control pitch.
Vocal tract and speech synthesis
When you increase the vocal-tract-length, the speaker will sound like they’re bigger. When you decrease it, they will sound smaller.
Here’s how you can modify the length of the speaker’s vocal tract:
+n% or -n%: adjusts the vocal tract length by a relative percentage change in the current voice. For example, +4% or -2%.
n%: adjusts the vocal tract length to an absolute percentage value of the current voice. For example, 104% or 98%.
Vocal tract length can be increased up to +100%, and down to -50%.
To reset the vocal tract length to the default value for the current voice, use
The following example shows how the vocal tract length can be modified, using Joanna’s voice:
Combining multiple tags
You can combine the vocal-tract-length SSML tag with any other SSML tag that is supported by Amazon Polly. Since vocal tract length and pitch are closely connected in nature, you might get the best results by changing the vocal tract length together with the pitch (by applying the tag).
The pitch and timbre of a person’s voice are connected in human speech.
If you are going to reduce the vocal tract length,
you might consider increasing the pitch as well. If instead you choose to lengthen the vocal tract,
you might also want to lower the pitch.
Samples range from very lifelike speech, to more character-like speech.
Vocal-Tract-Length and Pitch Sample Matrix (.ppt)
Modify the Timbre of Amazon Polly Voices with the New Vocal Tract SSML Feature
To make Alexa pause while she talks, you can add an SSML tag into the middle of your text. The SSML format for breaks is just one tag, and follows this format:
The amount of time can either be in seconds (s), or milliseconds (ms). Remember to add the forward slash after the pause length, otherwise your tag won’t work!
Some skills read plenty of text, and begin to sound mechanical if there are no natural pauses. Using breaks between paragraphs can solve that, and the <p></p> markup provides a brief pause that can easily be controlled when scripting the request. There are other times where a longer break is necessary that can be solved by setting a timed break (code below). This can be for up to ten seconds.
Polly puts a ~0.4 second between sentences and a ~0.7 second pause between paragraphs which can also be done in code.
strong: Increase the volume and slow down the speaking rate so the speech is louder and slower.
moderate: Increase the volume and slow down the speaking rate, but not as much as when set to strong. This is used as a default if level is not provided.
reduced: Decrease the volume and speed up the speaking rate. The speech is softer and faster.
Represents a paragraph. This tag provides extra-strong breaks before and after the tag. This is equivalent to specifying a pause with break strength=”x-strong”.
Provides a phonemic/phonetic pronunciation for the contained text. For example, people may pronounce words like “pecan” differently.
When using this tag, Alexa uses the pronunciation provided in the ph attribute rather than the text contained within the tag. However, you should still provide human-readable text within the tags. In the following example, the word “pecan” shown within the tags is never spoken. Instead, Alexa speaks the text provided in the ph attribute:
Modifies the volume, pitch, and rate of the tagged speech.
Modify the rate of the speech:
x-slow, slow, medium, fast, x-fast: Set the rate to a predefined value.
n%: specify a percentage to increase or decrease the speed of the speech:
100% indicates no change from the normal rate.
Percentages greater than 100% increase the rate.
Percentages below 100% decrease the rate.
The minimum value you can provide is 20%.
Raise or lower the tone (pitch) of the speech:
x-low, low, medium, high, x-high: Set the pitch to a predefined value.
+n%: Increase the pitch by the specified percentage. For example: +10%, +5%. The maximum value allowed is +50%. A value higher than +50% is rendered as +50%.
-n%: Decrease the pitch by the specified percentage. For example: -10%, -20%. The smallest value allowed is -33.3%. A value lower than -33.3% is rendered as -33.3%.
Change the volume for the speech:
silent, x-soft, soft, medium, loud, x-loud: Set volume to a predefined value for current voice.
+ndB: Increase volume relative to the current volume level. For example, +0dB means no change of volume. +6dB is approximately twice the current amplitude. The maximum positive value is about +4.08dB.
-ndB: Decrease the volume relative to the current volume level. For example, -6dB means approximately half the current amplitude.
Represents a sentence. This tag provides strong breaks before and after the tag.
This is equivalent to:
Ending a sentence with a period (.).
Specifying a pause with break strength=”strong”.
Describes how the text should be interpreted. This lets you provide additional context to the text and eliminate any ambiguity on how Alexa should render the text. Indicate how Alexa should interpret the text with the interpret-as attribute.
characters, spell-out: Spell out each letter.
cardinal, number: Interpret the value as a cardinal number.
ordinal: Interpret the value as an ordinal number.
digits: Spell each digit separately .
fraction: Interpret the value as a fraction. This works for both common fractions (such as 3/20) and mixed fractions (such as 1+1/2).
unit: Interpret a value as a measurement. The value should be either a number or fraction followed by a unit (with no space in between) or just a unit.
date: Interpret the value as a date. Specify the format with the format attribute.
time: Interpret a value such as 1’21” as duration in minutes and seconds.
telephone: Interpret a value as a 7-digit or 10-digit telephone number. This can also handle extensions (for example, 2025551212×345).
address: Interpret a value as part of street address.
interjection: Interpret the value as an interjection. Alexa speaks the text in a more expressive voice. For optimal results, only use the supported interjections and surround each speechcon with a pause. For example: <say-as interpret-as=”interjection”>Wow.</say-as>. Speechcons are supported for the languages listed below.
expletive: “Bleep” out the content inside the tag.
Only used when interpret-as is set to date. Set to one of the following to indicate format of the date:
Alternatively, if you provide the date in YYYYMMDD format, the format attribute is ignored. You can include question marks (?) for portions of the date to leave out. For instance, Alexa would speak <say-as interpret-as=”date”>????0922</say-as> as “September 22nd”.
Note that the Alexa service attempts to interpret the provided text correctly based on the text’s formatting even without this tag. For example, if your output speech includes “202-555-1212″, Alexa speaks each individual digit, with a brief pause for each dash. You don’t need to use say-as interpret-as=”telephone” in this case. However, if you provided the text “2025551212”, but you wanted Alexa to speak it as a phone number, you would need to use say-as interpret-as=”telephone”.
To include a speechcon in your skill’s text-to-speech response, use the <say-as interpret-as=”interjection”> SSML tag:
Be sure surround each speechcon with a pause. You can use punctuation (such as a period or comma) or other SSML tags (for instance, <break> or <s>) for pauses.
Speechcon Names – 183
- all righty
- as you wish
- au revoir
- aw man
- bada bing bada boom
- bah humbug
- batter up
- beep beep
- bon appetit
- bon voyage
- boo hoo
- cha ching
- cheer up
- choo choo
- click clack
- cock a doodle doo
- ding dong
- dot dot dot
- dun dun dun
- en gard
- fancy that
- giddy up
- good grief
- good luck
- good riddance
- great scott
- heads up
- hear hear
- hip hip hooray
- jeepers creepers
- jiminy cricket
- just kidding
- knock knock
- le sigh
- look out
- mamma mia
- man overboard
- mazel tov
- nanu nanu
- neener neener
- no way
- now now
- oh boy
- oh brother
- oh dear
- oh my
- oh snap
- okey dokey
- ooh la la
- open sesame
- read ’em and weep
- ruh roh
- spoiler alert
- ta da
- ta ta
- tee hee
- there there
- tick tick tick
- tsk tsk
- uh huh
- uh oh
- wah wah
- watch out
- way to go
- well done
- well well
- whoops a daisy
- woo hoo
- yadda yadda yadda
- yoo hoo
- you bet
Polly supports SSML tags with two extensions: breaths and voice effects.
The self-closing amazon:breathtag instructs the artificial speaker to take a (fairly life-like) breath of a specified length and volume.
Voice effects include whispering, speaking softly and changing the vocal tract length to make the speaker sound bigger or smaller.
Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. SSML specifies a fair amount of markup for prosody. This includes markup for
- pitch range
It will no longer be good enough to get something published. Differentiation will come from standout quality.
SSML that enables complete control over audio rendering. If you’re familiar with CSS for browser apps, SSML is the equivalent for audio apps. Imagine how poor the user experience for our favorite websites would be if they only rendered basic text. It’s the same way for voice, and why it is so important to learn SSML to further the audio experience.
Prosody is linguistic speech elements – not vowels and consonants. They are properties of syllables and larger units of speech. These include intonation, tone, stress, and rhythm.
Prosody reflects various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance (statement, question, or command); the presence of irony or sarcasm; emphasis, contrast, and focus. It may otherwise reflect other elements of language that may not be encoded by grammar or by choice of vocabulary.
Prosodic speech aspects distinguish between auditory measures (subjective impressions produced in the mind of the listener) and acoustic measures (physical properties of the sound wave that may be measured objectively). Auditory and acoustic measures of prosody do not correspond in a linear way.
In auditory terms, the major variables are:
- the pitch of the voice (varying between low and high)
- length of sounds (varying between short and long)
- loudness, or prominence (varying between soft and loud)
- timbre (quality of sound)
In acoustic terms, these correspond reasonably closely to:
- fundamental frequency (measured in hertz, or cycles per second)
- duration (measured in time units such as milliseconds or seconds)
- intensity, or sound pressure level (measured in decibels)
- spectral characteristics (distribution of energy at different parts of the audible frequency range)
Different combinations of these variables are exploited in intonation and stress, as well as rhythm, tempo and loudness. Additional prosodic variables include voice quality and pausing.
Prosodic features are said to be suprasegmental, since they are properties of units of speech larger than the individual segment (though exceptionally it may happen that a single segment may constitute a syllable, and thus even a whole utterance, e.g. “Ah!”). It is necessary to distinguish between the personal, background characteristics that belong to an individual’s voice (for example their habitual pitch range) and the independently variable prosodic features that are used contrastively to communicate meaning (for example, the use of changes in pitch to indicate the difference between statements and questions).
English intonation is based on three aspects:
- The division of speech into units
- The highlighting of particular words and syllables
- The choice of pitch movement (e.g. fall or rise)
These are sometimes known as Tonality, Tonicity and Tone.
Speakers are capable of a wide range of pitch (this is usually associated with excitement), at other times with a narrow range. English has been said to make use of changes in key: shifting one’s intonation into the higher or lower part of one’s pitch range is believed to be meaningful in certain contexts.
Stress makes a syllable prominent. Stress may be studied in relation to individual words (named “word stress” or lexical stress) or in relation to larger units of speech (traditionally referred to as “sentence stress” but more appropriately named “prosodic stress”). Stressed syllables are made prominent by several variables, by themselves or in combination.
Stress is associated with the following:
- pitch prominence, that is, a pitch level that is different from that of neighbouring syllables, or a pitch movement.
- increased length (duration).
- increased loudness (dynamics).
- differences in timbre: in English and some other languages, stress is associated with aspects of vowel quality (whose acoustic correlate is the formant frequencies or spectrum of the vowel). Unstressed vowels tend to be centralized relative to stressed vowels, which are normally more peripheral in quality.
These cues to stress are not equally powerful. Pitch, length and loudness form a scale of importance in bringing syllables into prominence. Pitch being the most efficacious, and loudness the least.
When pitch prominence is the major factor, the resulting prominence is often called accent rather than stress.
There is considerable variation from language to language concerning the role of stress in identifying words or in interpreting grammar and syntax.
Speech tempo is a measure of the number of speech units of a given type produced within a given amount of time. Speech tempo is believed to vary within the speech of one person according to contextual and emotional factors, between speakers and also between different languages and dialects. However, there are many problems involved in investigating this variance scientifically.
Measurements of speech tempo can be strongly affected by pauses and hesitations. For this reason, it is usual to distinguish between speech tempo including pauses and hesitations and speech tempo excluding them. The former is called speaking rate and the latter articulation rate.
One measure is sounds per second. Rates varying from an average of 9.4 sounds per second for poetry reading to 13.83 per second for sports commentary.
Monosyllables may be pronounced as “clipped”, “drawled” or “held” and polysyllabic utterances may be spoken at “allegro”, “allegrissimo”, “lento” and “lentissimo”.
The widespread view that some languages are spoken more rapidly than others is an illusion. This illusion is related to differences of rhythm and pausing.
Although rhythm is not a prosodic variable in the way that pitch or loudness are, it is usual to treat a language’s characteristic rhythm as a part of its prosodic phonology. It has often been asserted that languages exhibit regularity in the timing of successive units of speech, a regularity referred to as isochrony, and that every language may be assigned one of three rhythmical types: stress-timed (where the durations of the intervals between stressed syllables is relatively constant), syllable-timed (where the durations of successive syllables are relatively constant) and mora-timed (where the durations of successive morae are relatively constant).
Voiced or unvoiced, the pause is a form of interruption to articulatory continuity. Conversation analysis commonly notes pause length. Distinguishing auditory hesitation from silent pauses is one challenge. Contrasting junctures within and without word chunks can aid in identifying pauses.
There are a variety of “filled” pause types. Formulaic language pause fillers include “Like”, “Er” and “Uhm”, and paralinguistic expressive respiratory pauses include the sigh and gasp.
Although related to breathing, pauses may contain contrastive linguistic content, as in the periods between individual words in English advertising voice-over copy sometimes placed to denote high information content, e.g. “Quality. Service. Value.”
Pausing or its lack contributes to the perception of word groups, or chunks. Chunks commonly highlight lexical items or fixed expression idioms. The well-known English chunk “Know what I mean?” sounds like a single word (“No-whuta-meen?”) due to blurring or rushing the articulation of adjacent word syllables, thereby changing the potential open junctures between words into closed junctures.
Intonation is said to have a number of perceptually significant functions in English and other languages, contributing to the recognition and comprehension of speech.
The sentence “They invited Bob and Bill and Al got rejected” is ambiguous when written, although addition of a written comma after either “Bob” or “Bill” will remove the sentence’s ambiguity. But when the sentence is read aloud, prosodic cues like pauses (dividing the sentence into chunks) and changes in intonation will reduce or remove the ambiguity. Moving the intonational boundary in cases such as the above example will tend to change the interpretation of the sentence.
Intonation and stress work together to highlight important words or syllables for contrast and focus. A well-known example is the ambiguous sentence “I never said she stole my money”, where there are seven meaning changes depending on which of the seven words is vocally highlighted.
Prosody plays a role in the regulation of conversational interaction and in signaling discourse structure indicating whether information is new or already established; whether a speaker is dominant or not in a conversation; and when a speaker is inviting the listener to make a contribution to the conversation.
Prosody is also important in signalling emotions and attitudes. When this is involuntary (as when the voice is affected by anxiety or fear), the prosodic information is not linguistically significant. However, when the speaker varies her speech intentionally, for example to indicate sarcasm, this usually involves the use of prosodic features. The most useful prosodic feature in detecting sarcasm is a reduction in the mean fundamental frequency relative to other speech for humor, neutrality, or sincerity. While prosodic cues are important in indicating sarcasm, context clues and shared knowledge are also important.
Native speakers listening to actors reading emotionally neutral text while projecting emotions correctly recognized happiness 62% of the time, anger 95%, surprise 91%, sadness 81%, and neutral tone 76%. When a database of this speech was processed by computer, segmental features allowed better than 90% recognition of happiness and anger, while suprasegmental prosodic features allowed only 44%–49% recognition. The reverse was true for surprise, which was recognized only 69% of the time by segmental features and 96% of the time by suprasegmental prosody.
In typical conversation (no actor voice involved), the recognition of emotion may be quite low, of the order of 50%, hampering the complex interrelationship function of speech advocated by some authors. However, even if emotional expression through prosody cannot always be consciously recognized, tone of voice may continue to have subconscious effects in conversation. This sort of expression stems not from linguistic or semantic effects, and can thus be isolated from traditional linguistic content. Aptitude of the average person to decode conversational implicature of emotional prosody has been found to be slightly less accurate than traditional facial expression discrimination ability; however, specific ability to decode varies by emotion. These emotional have been determined to be ubiquitous across cultures, as they are utilized and understood across cultures. Various emotions, and their general experimental identification rates, are as follows:
- Anger and sadness: High rate of accurate identification
- Fear and happiness: Medium rate of accurate identification
- Disgust: Poor rate of accurate identification
The prosody of an utterance is used by listeners to guide decisions about the emotional affect of the situation. Whether a person decodes the prosody as positive, negative, or neutral plays a role in the way a person decodes a facial expression accompanying an utterance. As the facial expression becomes closer to neutral, the prosodic interpretation influences the interpretation of the facial expression.
Unique prosodic features have been noted in infant-directed speech (IDS) – also known as baby talk, child-directed speech (CDS), or motherese. Adults, especially caregivers, speaking to young children tend to imitate childlike speech by using higher and more variable pitch, as well as an exaggerated stress. These prosodic characteristics are thought to assist children in acquiring phonemes, segmenting words, and recognizing phrasal boundaries. And though there is no evidence to indicate that infant-directed speech is necessary for language acquisition, these specific prosodic features have been observed in many different languages.
IBM – Cloud
Express-as Attributes – GoodNews , Apology , or Uncertainty.
By default, the IBM Text to Speech service synthesizes text in a neutral declarative style. The service extends SSML with an <express-as> element that produces expressiveness by converting text to synthesized speech in various speaking styles. The element is analogous to the SSML element <say-as>, which specifies text normalization for formatted text such as dates, times, and numbers.
GoodNews expresses a positive, upbeat message.
Apology expresses a message of regret.
Uncertainty conveys an uncertain, interrogative message.