For latin script, it is understandable that one may use OCR to upload hardcopies into database, but how are the dictionaries from non-Latin script languages, East Asian, Indian and African Languages converted to a database.
Some examples: [https://kosha.sanskrit.today/word/sa/prabhRti?q=%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AD%E0%A5%83%E0%A4%A4%E0%A4%BF](https://kosha.sanskrit.today/word/sa/prabhRti?q=%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AD%E0%A5%83%E0%A4%A4%E0%A4%BF)
[https://glosbe.com/](https://glosbe.com/)
In: 0
OCR can potentially work on any script; there are systems available in most scripts already, and even if not you can train a new one from scratch. (Although note that if you don’t have ANY other text in the language in question, it’s hard to correct the inevitable mistakes. You end up needing human error correction.)
But in practice, many online dictionary projects did something much easier than train a new OCR model: they just hired people to type it back in. Literate people are more plentiful and cheaper than people who know how to train a new OCR system. (And you need literate people anyway, to do the aforementioned error correction.)
Also, as an educated guess, the majority of dictionaries for the world’s languages have been written in the last 40 years, when we’ve had computers. (Of the 7k languages in the world, most don’t have a very long tradition of books and literacy. A lot of languages got their first dictionaries, bible translations, textbooks, etc. only recently.) So when the dictionaries were made by computer in the first place, we don’t need to get them back into computers (except when someone loses the source files… which tbh is dismayingly often).
Latest Answers