[ELI5] Why does autocorrect insist that the first letter of a misspelled word is more important than the rest of it?

994 viewsOtherTechnology

For example, if I spell “umportant”, it’s easy for us to recognise that it’s supposed to be “important”, but autocorrect insists that it’s something like “umbrella”, or I guess more logically “unimportant”, even though “important” is only 1 correction away.

​

These are real examples from my phone (Samsung Galaxy):

​

Wuick gets the suggestions Wicked, Which, Wucky, Whickham, Whicker, Wick, Wickets, Wicket, and Wickham. None of which are “Quick”, what I intended to write.

​

Nrown gets the suggestions Now, Nr own, Noon, Nowhere, Nr owner, Nr owns, and Nr owners. None of which are “Brown”.

​

Dence gets the suggestions Dance, December, Denied, Dancers, Decent, Dense, Dench, and Deuce. None of which are “Fence”.

​

It’s bothered me for years that it never ever picks up on a misspelt first letter.

​

Edit: I tried “umportant”, and it actually comes with 0 suggestions. Not umbrella, not unimportant, not even “important”. But “inportant” and “ikportant” and even “iqportant” are all recognised as “important”.

In: Technology

14 Answers

Anonymous 0 Comments

I don’t work on these models so I’m not totally clear on whether they’re weighting probabilities based on similarities at the beginning of the word, but I do know how in general terms how models like this work (software engineer, had some projects about this sort of thing during school – though that was a few years ago!).

Theoretically, the model is going to try to apply four or so methods (inserting a letter that was missed, removing a letter that was mistakenly added, exchanging two letters, or replacing a letter) to different locations in a substring identified as an “error.” As such, a typo can be said to be *n* operations away from a valid word. It may not be exactly *n* operations away from the intended word though: for example *knto* is 1 operation away from both *into* and *unto*. This is more commonly interpreted by a metric called Levenshtein difference; measuring the number of single-character edits required to change one string into another.

Once your phone generates a list of possible intended words, (maybe sorting them by the Levenshtein difference to the error string) it’ll try to pick ones that seem probable in the context. Depending on the model, this can be a plain “likelihood-in-corpus” (*unto* is used a lot less frequently in our magic book of words, so we guess it’s probably *into* instead) or a slightly more complex n-gram analysis of the surrounding words (“o lord send knto me your grace” is probably supposed to be *unto*, but “I’m knto cheese and also love wine” probably isn’t.

It also may not be that the code is intentionally emphasizing the beginning of the word, but that there are optimization techniques (these big lists of possibilities can be a bit computationally expensive for such a small device where performance is vital) which somehow end up discarding edits that could be made to the beginning of words. With big companies especially, it’s tough to know what edits they’ve made to probability weighting or optimizations based on the limitations of the specific device hardware, et cetera. Unless they came out and told us, it’d be pretty hard to get a conclusive answer to this.

You are viewing 1 out of 14 answers, click here to view all answers.