I have some black & white documents (image scan) and want to cluster them according to their layout . To make thing more concrete, say I have the following three images and first two would more likely fall into the same cluster as opposed to the 3rd image, because the first two have relatively similar layout.
My question is, what would be the best approach to clustering the documents? Right now I have a couple of initial approaches:
Would there be other better approaches? Again, only the layout matters.
Don't attempt to cluster raw data.
Clustering is unsupervised, it can't learn what properties are important and what not. To a clustering algorithm, everything is important.
Instead, define layout relevant features first. Such as long edges.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.