eli5 Why did Microsoft Sam, the old Text to Speech voice, pronounce some words so wrong?

165 views

I’ve read the infamous soi bug was because it interpreted it as French, but some other words were also way off. Most notably, “crotch” would be pronounced as “Crows Nest”. What caused this, and what has happened in the development of Text to Speech that has made voices better?

In: 1

2 Answers

Anonymous 0 Comments

English isn’t a phonetic language, but text-to-speech needs to use specific sounds for each letter/syllable (each sound is called a phoneme). So there is usually a dictionary or set of rules to determine how to pronounce a word. Laughter vs daughter for example, polish vs Polish. If a word was pronounced incorrectly, the word probably wasn’t in the dictionary and it was trying to pronounce it using the rules.

Some of the improvement is just the improvement in the phoneme sounds used for text-to-speech. Some early computers didn’t have the ability to playback digital samples, they didn’t have much storage or access to internet, so they couldn’t have recordings for every word. They had to approximate the sound using the capabilities of the sound chip. (Commodore 64 SAM: https://www.youtube.com/watch?v=B4_fjy9b7WI, and the rules: https://github.com/DLehenbauer/c64-sam/blob/main/src/sam.s) Newer text-to-speech uses digital samples for phonemes or even the entire words.

Anonymous 0 Comments

English isn’t a phonetic language, but text-to-speech needs to use specific sounds for each letter/syllable (each sound is called a phoneme). So there is usually a dictionary or set of rules to determine how to pronounce a word. Laughter vs daughter for example, polish vs Polish. If a word was pronounced incorrectly, the word probably wasn’t in the dictionary and it was trying to pronounce it using the rules.

Some of the improvement is just the improvement in the phoneme sounds used for text-to-speech. Some early computers didn’t have the ability to playback digital samples, they didn’t have much storage or access to internet, so they couldn’t have recordings for every word. They had to approximate the sound using the capabilities of the sound chip. (Commodore 64 SAM: https://www.youtube.com/watch?v=B4_fjy9b7WI, and the rules: https://github.com/DLehenbauer/c64-sam/blob/main/src/sam.s) Newer text-to-speech uses digital samples for phonemes or even the entire words.