OCR

From: Peter (BOUGHTONP) 6 Apr 2019 00:30
To: ALL1 of 11
I have a photo of a printed spreadsheet of names and numbers, which I figured would be a piece of piss to run through OCR and get back into useful data, but it turns out not. :(

Giving the image to Tesseract-OCR it can detect the names fine - can't see any errors in any of the letters - but for some of the numbers it's doing a ridiculously bad job, even after processing the image to increase contrast and make things completely unambiguous, I still get "ns" in place of "118", or when the numbers are "0 124 97 221" it provides "Q 124 on eel"

Anyone know of OCR that can actually detect digits? Or even just something which behaves consistently - I can easily de-1337 if it simply mis-identifies - but not when "ee" might represent 77 or 84 or something else entirely! :@

From: graphitone 6 Apr 2019 09:24
To: Peter (BOUGHTONP) 2 of 11
Get typing.

From: ANT_THOMAS 6 Apr 2019 10:01
To: Peter (BOUGHTONP) 3 of 11
Hahahahaha, how have you ended up with that?

Probably could have typed it up by now.
From: Peter (BOUGHTONP) 6 Apr 2019 14:13
To: ANT_THOMAS 4 of 11
It's competition results from my climbing centre - usually they're also posted online, but the centre staff aren't given the ability to update the website, and it seems the person who does has forgotten or not bothered.

I could probably ask for an emailed copy, but I don't understand how detecting numerical digits can be difficult.

What's even more frustrating is that it's supposed to be possible to limit Terreract to only detecting digits, but it doesn't work - it turns out they removed the ability to blacklist/whitelist characters in the current version. :@

From: ANT_THOMAS 6 Apr 2019 17:08
To: Peter (BOUGHTONP) 5 of 11
I hope you've OCRed the names and typed the numbers by now. But you're right OCR software never seems accurate enough. Even trying to use it to read my electric meter was too much hassle for the time saving. And that was very regular characters. Probably could have put more effort in to find a way to give the OCR some history of characters to work with to make matching more accurate.
From: CHYRON (DSMITHHFX) 6 Apr 2019 17:51
To: Peter (BOUGHTONP) 6 of 11
Are you scanning it or taking a digital photo? Things like original print quality and typography will hugely affect accuracy. Also quality of the OCR software varies quite a bit.
From: Peter (BOUGHTONP) 6 Apr 2019 20:58
To: ANT_THOMAS 7 of 11
It's not urgent so I'm leaving it for now hoping someone will pop along with a miracle cure, or I'll just ask for the original file when I'm next in.

It is annoying it doesn't seem to have moved on in the past two decades - it should be possible to point OCR at anything, have it identify glyphs, then ask for feedback on which ones it got wrong, repeat until happy. Bleh.

From: Peter (BOUGHTONP) 6 Apr 2019 21:34
To: CHYRON (DSMITHHFX) 8 of 11
It's a photo taken with a digital compact, so there's a degree of noise and slight gradient, but there's no reason why it shouldn't be 99% OCR-able.

For example, attached is a crop of the row that gave "Q 124 on eel" - on its own it produces "124 97 2el", and in the first image (fixed horizontal/verticals, but gridlines still present and no brightness/contrast changes), it came closest with "0 124 on 221".

Attachments:
From: Peter (BOUGHTONP) 6 Apr 2019 22:03
To: ALL9 of 11
I had the thought of forgetting about OCR and searching for what I actually want, i.e: "image to spreadsheet conversion", which came up with this: https://online2pdf.com/convert-jpg-to-excel

The formatting it produced was all over the place, but it did a good job on the numbers - a handful of mistakes, mostly with zeroes. A couple of incorrect numbers (161->151 and 77->17) which were highlighted through the totals not matching, but compared to Tesseract it was brilliant.

Happy Peter -> :)

From: CHYRON (DSMITHHFX) 6 Apr 2019 22:23
To: Peter (BOUGHTONP) 10 of 11
I've had good luck with online OCR, though not tried for excel.
From: Peter (BOUGHTONP) 6 Apr 2019 22:37
To: CHYRON (DSMITHHFX) 11 of 11
I'm guessing it's mostly just regular OCR, but uses tabs if there's more than a single space, although the file I got back did have merged cells with a dozen spaces for some of the rows, suggesting buggy overcomplicated logic.

We need to set Stallman on them all.