训练tesseract 3得到信件表(Train tesseract 3 to get table of letters)

编程入门行业动态更新时间:2024-10-23 06:32:01

我一直在尝试使用普通的tesseract 3 OCR，使用不同的选项从一个字母表中获取数据，我的学生将其标记为多个选择题的答案，如下所示：

最好的输出之一是：

EEEEEEEEEEEEEEEEEEEEEEEEE DDDDDDDDDDDDDDDDDDDDDDDDD CCCCCCCCCCCCCCCCCCCCCCCCC BBBBBBBEBBBBBBBBBBBBBBBBB AAAAAAAAAAAAAAAAAAAAAAAAA 6789012345678901234567890 2222333333333344444444445 EEEEE EEEE EE EEE EEEEEEE DDDDDD DDD DDDDDDDDDDDD CCCCCCCCCCCCCCCCCC CCCCC B BEBE BB BBBBBBBBBBBBBBB AA AAA AAAAA AAAAAAAA 1234567890123455789012345 OOOOOOOOO1111111111222222

我知道我可以解析.txt并获得更好的结果，但它错过了很多信息并得到了一些彩绘块的字母。

我想知道如何才能在这种情况下获得更好的结果。

我还希望有一个表格，其中绘制的块显示为不同的字符，例如，对于图像的第一行和第二行：

01 A B C - E 26 A B C D E 02 A - C D E 27 A B C D E

如果你们有类似的经历，任何信息将不胜感激！提前致谢！

I've been trying to use plain tesseract 3 OCR using different options to get the data from a table of letters where my students marked one as answers for multiple choice questions, as seen below:

One of the best outputs was:

I know I can parse that .txt and have a better result, but it missed a lot of information and got the letters from some of the painted blocks.

I wanted to know what can I do to get better result for this case.

I would also like to have a table with the painted blocks appearing as a different character, for example, for the first and second lines of the image:

01 A B C - E 26 A B C D E 02 A - C D E 27 A B C D E

If you guys have some similar experience, any information will be appreciated! Thanks in advance!

最满意答案

首先，我建议你对图像进行预处理，例如使暗部变暗，稍微模糊一点。随意进行实验，直到Tesseract停止在填充的方块中看到字母。

其次，您有两种选择：

一，您可以启用hOCR输出并尝试自己解析扫描字母的布局。 hOCR是HTML的一个子集，它包含所有已识别单词的坐标。尝试找出行和列的位置。

或者，尝试使Tesseract正确识别布局，而不是旋转90°。

无论如何，这就是我做的：

我通过ImageMagick运行图像：

$ convert CDZjN.png -deskew 40% -contrast-stretch 7%x10% -filter lanczos -resize 250% ooo.png

2.我为Tesseract创建了一个配置文件t.conf ，禁用了垂直文本检测和英文字典：

textord_tabfind_vertical_text 0 load_system_dawg 0 load_freq_dawg 0 load_punc_dawg 0 load_number_dawg 0 load_unambig_dawg 0 load_bigram_dawg 0 load_fixed_length_dawgs 0

我只是跑了它：

$ tesseract ooo.png ooo t.conf ; cat ooo.txt Tesseract Open Source OCR Engine v3.02 with Leptonica 01ABC-E 26ABCDE 02A CDE 27ABCDE o3 BCDE 28ABCDE o4 BCDE 29ABCDE o5 BCDE 30ABCDE 06ABCD. 31ABCDE 07A-CDE 32ABCDE 08ABC.E 33ABCDE o9 BCDE 34ABCDE 10A CDE 35ABCDE 11ABCD 36ABCDE 12ABC E 37ABCDE 13ABC E 38ABCDE 14ABCD 39ABCDE 15 BCDE 40ABCDE 1s BCDE 41ABCDE 17 BCDE 42ABCDE 18ABCD_ 43ABCDE 19AB DE 44ABCDE 20AB DE 45ABCDE 21ABCDE 46ABCDE 22ABCDE 47ABCDE 23ABCDE 48ABCDE 24ABCDE 49ABCDE 25ABCDE 50ABCDE

不完美，但可以通行。

First, I suggest you preprocess your image, for example making the dark parts darker, blur it a little. Feel free to experiment until Tesseract stops seeing letters in the filled-in squares.

Second, you have two options:

One, you can enable hOCR output and try to parse the layout of the scanned letters yourself. hOCR is a subset of HTML and it contains coordinates of all recognized words. Try figuring out where the rows and columns are.

Alternatively, try making Tesseract recognise the layout properly, not rotated 90°.

Anyway, this is what I did:

1. I ran the image through ImageMagick:

$ convert CDZjN.png -deskew 40% -contrast-stretch 7%x10% -filter lanczos -resize 250% ooo.png

2. I created a config file t.conf for Tesseract, disabling vertical text detection and English dictionary:

textord_tabfind_vertical_text 0 load_system_dawg 0 load_freq_dawg 0 load_punc_dawg 0 load_number_dawg 0 load_unambig_dawg 0 load_bigram_dawg 0 load_fixed_length_dawgs 0

3. I simply ran it:

Not perfect, but passable.

更多推荐

本文发布于:2023-08-05 05:03:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1427570.html