How to detect and remove guide lines from a scanned image/document efficiently?

Question

For my project i am writing an image pre processing library for scanned documents. As of now I am stuck with line removal feature.

Problem Description: A sample scanned form:

Name*  : ______________________________
Age* : ______________________________

Email-ID: |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|

Following are the further conditions: 以下是进一步的条件：

The scanned document may contain many more vertical and horizontal guiding lines.
Thickness of the lines may exceed 1px
The document itself is not printed properly and might have noise in the form of ink bloating or uneven thickness
The document might have colored background or lines

Now what I am trying to do is to detect these lines and remove them. And while doing so the hand written content should not be lost.

Solution so for: The current solution is implemented in Java.

Detected these lines by using a combination of canny/sobel edge detectors and a threshold filter(to make image bitonal). From the previous action I get a black and white array of pixels. Traverse the array and check whether lumanicity of that pixel falls below a specified bin value. And if I found 30 (minimum line length in pixels) such pixels, I remove them. I repeat the same for vertical lines but considering the fact there will be cuts due to horizontal line removal.

Although the solution seems to work. But there are problems like,

Removal of overlapping characters
If characters in the image are not properly spaced then it is also considered as a line.
The output image from edge detection is in black and white.
A bit slow. Normally takes around 40 seconds for image of 2480*3508.

Kindly guide how to do it properly and efficiently. And if there is an opensource library then please direct.

Thanks

Answer 1

First, I want to mention that I know nothing about image processing in general, and about OCR in particular.

Still, a very simple heuristic comes to my mind:

Separate the pixels in the image to connected components.
For each connected component decide if it is a line or not using one or more of the following heuristics:
1. Is it longer that the average letters length?
2. Does it appear near other letters? (To remove ink bloats or artifacts).
3. Does its X gradient and Y gradient large enough? This could make sure that this connected component contains more than just horizontal line.

The only problem I can see is, if somebody writes letters on a horizontal line, like so:

   /\     ___
  /  \   /   \
  |__|   |___/
 -|--|---|---|------------------
  |  |    \__/

In that case the line would remain, but you have to handle this case anyhow.

As I mentioned, I'm by no means an image processing expert, but sometimes very simple tricks work.

How to detect and remove guide lines from a scanned image/document efficiently?

Question

1 answers

solution1
1 ACCPTED 2010-06-29 13:50:34

How to detect and remove guide lines from a scanned image/document efficiently?

Question

1 answers

solution1 1 ACCPTED 2010-06-29 13:50:34

solution1
1 ACCPTED 2010-06-29 13:50:34