AnswerCult

Question

477 viewsJanuary 1, 2024

Question 100.55K May 4, 2023 0 Comments

I’ve read the infamous soi bug was because it interpreted it as French, but some other words were also way off. Most notably, “crotch” would be pronounced as “Crows Nest”. What caused this, and what has happened in the development of Text to Speech that has made voices better?

In: 1

2 Answers

Answer 1 · 2023-05-07T16:29:53+00:00

English isn’t a phonetic language, but text-to-speech needs to use specific sounds for each letter/syllable (each sound is called a phoneme). So there is usually a dictionary or set of rules to determine how to pronounce a word. Laughter vs daughter for example, polish vs Polish. If a word was pronounced incorrectly, the word probably wasn’t in the dictionary and it was trying to pronounce it using the rules.

Some of the improvement is just the improvement in the phoneme sounds used for text-to-speech. Some early computers didn’t have the ability to playback digital samples, they didn’t have much storage or access to internet, so they couldn’t have recordings for every word. They had to approximate the sound using the capabilities of the sound chip. (Commodore 64 SAM: https://www.youtube.com/watch?v=B4_fjy9b7WI, and the rules: https://github.com/DLehenbauer/c64-sam/blob/main/src/sam.s) Newer text-to-speech uses digital samples for phonemes or even the entire words.

Answer 2 · 2023-05-07T16:29:53+00:00

English isn’t a phonetic language, but text-to-speech needs to use specific sounds for each letter/syllable (each sound is called a phoneme). So there is usually a dictionary or set of rules to determine how to pronounce a word. Laughter vs daughter for example, polish vs Polish. If a word was pronounced incorrectly, the word probably wasn’t in the dictionary and it was trying to pronounce it using the rules.

Some of the improvement is just the improvement in the phoneme sounds used for text-to-speech. Some early computers didn’t have the ability to playback digital samples, they didn’t have much storage or access to internet, so they couldn’t have recordings for every word. They had to approximate the sound using the capabilities of the sound chip. (Commodore 64 SAM: https://www.youtube.com/watch?v=B4_fjy9b7WI, and the rules: https://github.com/DLehenbauer/c64-sam/blob/main/src/sam.s) Newer text-to-speech uses digital samples for phonemes or even the entire words.

AnswerCult

eli5 Why did Microsoft Sam, the old Text to Speech voice, pronounce some words so wrong?

2 Answers

Search questions

Popular Questions

Latest Answers