English isn’t a phonetic language, but text-to-speech needs to use specific sounds for each letter/syllable (each sound is called a phoneme). So there is usually a dictionary or set of rules to determine how to pronounce a word. Laughter vs daughter for example, polish vs Polish. If a word was pronounced incorrectly, the word probably wasn’t in the dictionary and it was trying to pronounce it using the rules.
Some of the improvement is just the improvement in the phoneme sounds used for text-to-speech. Some early computers didn’t have the ability to playback digital samples, they didn’t have much storage or access to internet, so they couldn’t have recordings for every word. They had to approximate the sound using the capabilities of the sound chip. (Commodore 64 SAM: https://www.youtube.com/watch?v=B4_fjy9b7WI, and the rules: https://github.com/DLehenbauer/c64-sam/blob/main/src/sam.s) Newer text-to-speech uses digital samples for phonemes or even the entire words.
English isn’t a phonetic language, but text-to-speech needs to use specific sounds for each letter/syllable (each sound is called a phoneme). So there is usually a dictionary or set of rules to determine how to pronounce a word. Laughter vs daughter for example, polish vs Polish. If a word was pronounced incorrectly, the word probably wasn’t in the dictionary and it was trying to pronounce it using the rules.
Some of the improvement is just the improvement in the phoneme sounds used for text-to-speech. Some early computers didn’t have the ability to playback digital samples, they didn’t have much storage or access to internet, so they couldn’t have recordings for every word. They had to approximate the sound using the capabilities of the sound chip. (Commodore 64 SAM: https://www.youtube.com/watch?v=B4_fjy9b7WI, and the rules: https://github.com/DLehenbauer/c64-sam/blob/main/src/sam.s) Newer text-to-speech uses digital samples for phonemes or even the entire words.
Latest Answers