Optical character recognition (OCR) is extracting text from images. It means you don't have to manually type it over.
For this website, I used ABBYY FineReader. I also tried OmniPage, as well as several online services, including the OCR function in Google Keep. ABBYY was definitely the best. For Japanese text, at least.
Alas, not even ABBYY was flawless, despite the high quality of my scans. I would say it was about 99,5% accurate. Fortunately, ABBYY has an extremely useful feature which highlights characters it wasn't sure of. These characters are given a nice skyblue background. However, I also spotted some incorrect characters that were not highlighted, so this feature was also not perfect.
If characters were marked by ABBYY as 'not sure', you can right click on them and select the correct characters. But sometimes, the correct character isn't shown, so you'll have to manually correct it. You can do this by rescanning/reOCRing those specific characters. There are two ways to do this:
- Load the image a second time in ABBYY, and draw a very small rectangular around those specific characters. If it works, copy-paste the characters into the original image. This often worked in ABBYY, but it didn't work if the Japanese characters were attached to each other.
- If the Japanese characters are too close or attached to each other: open the picture in MS Paint or some other image editor. Cut out the attached characters and 'pull them apart', then rescan them.
In the case of kana (i.e. hiragana and katakana), the above wasn't necessary, because I could just manually type the right character. However, I didn't knew (and still don't know) how to type kanji.
To OCR specific kanji, I sometimes also used the free program KanjiTomo. It recognized certain kanji which ABBYY could not. If not even KanjiTomo could recognize it, I searched the online dictionary on tanoshiijapanese.com as a last resort.
Reason for mistakes
As mentioned above, ABBYY had difficulties recognizing the characters if they were too close or attached.
Another reason was that certain characters in Japanese resemble each other. See this article for more information on this.
In particular, ABBYY FineReader had a nasty habit of mixing up the chōonpu (ー) and the dash (—), resulting in this:
was translated by GT as “Pokemon”, while
was translated by GT as “Pokemon Star”.
Fortunately, ABBYY did mark most of these vertical lines as 'unsure', so you can usually correct them manually.