[ELI5] Why does autocorrect insist that the first letter of a misspelled word is more important than the rest of it?

1.42K viewsOtherTechnology

For example, if I spell “umportant”, it’s easy for us to recognise that it’s supposed to be “important”, but autocorrect insists that it’s something like “umbrella”, or I guess more logically “unimportant”, even though “important” is only 1 correction away.

​

These are real examples from my phone (Samsung Galaxy):

​

Wuick gets the suggestions Wicked, Which, Wucky, Whickham, Whicker, Wick, Wickets, Wicket, and Wickham. None of which are “Quick”, what I intended to write.

​

Nrown gets the suggestions Now, Nr own, Noon, Nowhere, Nr owner, Nr owns, and Nr owners. None of which are “Brown”.

​

Dence gets the suggestions Dance, December, Denied, Dancers, Decent, Dense, Dench, and Deuce. None of which are “Fence”.

​

It’s bothered me for years that it never ever picks up on a misspelt first letter.

​

Edit: I tried “umportant”, and it actually comes with 0 suggestions. Not umbrella, not unimportant, not even “important”. But “inportant” and “ikportant” and even “iqportant” are all recognised as “important”.

In: Technology

14 Answers

Anonymous 0 Comments

I don’t work on these models so I’m not totally clear on whether they’re weighting probabilities based on similarities at the beginning of the word, but I do know how in general terms how models like this work (software engineer, had some projects about this sort of thing during school – though that was a few years ago!).

Theoretically, the model is going to try to apply four or so methods (inserting a letter that was missed, removing a letter that was mistakenly added, exchanging two letters, or replacing a letter) to different locations in a substring identified as an “error.” As such, a typo can be said to be *n* operations away from a valid word. It may not be exactly *n* operations away from the intended word though: for example *knto* is 1 operation away from both *into* and *unto*. This is more commonly interpreted by a metric called Levenshtein difference; measuring the number of single-character edits required to change one string into another.

Once your phone generates a list of possible intended words, (maybe sorting them by the Levenshtein difference to the error string) it’ll try to pick ones that seem probable in the context. Depending on the model, this can be a plain “likelihood-in-corpus” (*unto* is used a lot less frequently in our magic book of words, so we guess it’s probably *into* instead) or a slightly more complex n-gram analysis of the surrounding words (“o lord send knto me your grace” is probably supposed to be *unto*, but “I’m knto cheese and also love wine” probably isn’t.

It also may not be that the code is intentionally emphasizing the beginning of the word, but that there are optimization techniques (these big lists of possibilities can be a bit computationally expensive for such a small device where performance is vital) which somehow end up discarding edits that could be made to the beginning of words. With big companies especially, it’s tough to know what edits they’ve made to probability weighting or optimizations based on the limitations of the specific device hardware, et cetera. Unless they came out and told us, it’d be pretty hard to get a conclusive answer to this.

Anonymous 0 Comments

It’s because generally you wouldn’t keep typing if you’ve already mistyped the first letter of a word so auto correct will assume the following letters are incorrect. This paired with the fact that you should at least know the first letter of the word you’re trying to type.

Anonymous 0 Comments

I think it’s intended.

If you have the first and second letter the possible words it could be brings down the search time dramatically.

Finding words with a specific 2 letter start is significantly faster than find all words with the begin with a specific letter and all words who have a specific 2nd letter. There’s no point in even having auto correct if you could type out a chapter of a book before it displays the possible word.

Anonymous 0 Comments

Many auto-correct systems and word suggesters use a data structure known as a trie that you can visualize as a tree. At the very top is an empty value (i.e., no characters). Then, it has a branch for every letter so that there’s “a”, “b,” … “z.” A branch then extends from every letter for all letters that could make a possible word. So A would have B and C and most if not all letters as branches extending from it. Q, on the other hand, would mainly lead to U.

Let’s say you start typing the beginning of “umbrella.” First, you’d traverse the trie from the empty value to the U. From there, you could find branches to other letters that might form a word. You could traverse to N because of words like “undo” or to Z because of words like “uzi.”

Now, let’s say you accidentally type “umport.” “Ump” is valid because it could lead to “umpire,” but I’m pretty sure that “umpo” doesn’t lead to any words, so it has no branches. It looks at the last possible valid path and returns the first word it finds, which might be umpire.

Each implementation could be different leading to slightly different outcomes. Some don’t use tries at all, but many do.