简体   繁体   中英

Which measure indicates a smooth variation of data?

I am trying to compare text and non-text regions based on the thickness of lines/strokes. Using the distance transform and some fiddling thereafter, managed to obtain the thickness (actually half the thickness) of each stroke comprising the features in a picture.

Here's a typical result of a program run:

1.Text region

34444433343554335533553555545544455445533444444344455435553335545556665444445654444444444444444444444444455434554554455444456544444445555445555543355556665544665444535444553354434553444444444444455444445544444454444444444444444444444444455442444444554444444544444444444444554444456444554414454444444444444444444554444445543454445443444544434443344443334442133223332221

  1. Non-text

So is there any statistical measure more sophisticated than standard deviation that will indicate the difference in the two datasets: one varies gradually while the second one has drastic variations? (included the scary numbers to illustrate what I'm attempting to quantize!)

Also please note that the number of data points will not be same, as i'll be comparing different regions with some experimentally determined threshold of SD (or some other measure), not regions among themselves.

If you are interested in measuring the smoothness, the standard deviation of the differences between adjacent thicknesses should be much smaller for text than non-text.

You can thus simply convert

34444433343554335533553555545544455445533444444344455435553335545556665444445654444444444444444444444444455434554554455444456544444445555445555543355556665544665444535444553354434553444444444444455444445544444454444444444444444444444444455442444444554444444544444444444444554444456444554414454444444444444444444554444445543454445443444544434443344443334442133223332221

into

1000(-1)000…

(1 = 4-3, 0 = 4-4, etc.). The standard deviation of this list of differences is small, for text regions (in your example, this list contains many zeros).

If you need to keep using numbers between 0 and 9 for the thickness difference between thickness t1 and thickness t2 , you can perform a rescaling: round((t2-t1+9)/2) .

The thought that comes to my mind is that you could do a wavelet transform on a chunk then look at the average energy associated with high frequency wavelets.

If you're not familiar with wavelets, the simplest to describe is the Haar wavelet . Assuming that the number of points you have sampled is 2 n , you can calculate that as follows:

  1. Divide your data into pairs of points.
  2. Take 1/2 of the difference. That is the coefficient of the detail wavelet.
  3. Take the average of each pair. This gives you 2 n-1 points. Recursively do a wavelet transform on those.

For each level of the Haar wavelet, take the average of the square of the coefficient. If your data really looks like what you've described, this statistic for the first few levels will be very different. Experiment, decide where your threshold is, and you'll probably have a pretty reliable test. (I would recommend having 3 possible answers from your test, "Text", "Not text", "unclear". Look at the "unclear" examples and then improve your test.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM