I have the following sample images of heatmaps (I have hundreds of these images...for now, it will grow later):
Using my human eye, I'd say that heatmap1, 3 and 4 are similar to each other, or maybe 3 and 4 are most similar to each other, I'm not sure.
I'd like to be able to group the heatmap figures that are most similar to each other into different groups, based on the patterns and their intensities.
For example, each of the heatmap contains 24 rows and 5 columns (The rows represent time and the columns represent features). Each color in each column represents a number between 0 to 1. The pattern and intensities in column 1 for heatmap 3 and 4 is more similar compare to the other heatmaps . But, instead of looking at each column, I want to compare the overall patterns and intensities of each heatmap to one another.
I thought I was going to use kmeans clustering, but couldn't find any info that could help me in achieving what I want. My search ends up with hierarchical clustering quite a lot, which will not help me, from what I understand.
Then, I found some information on image hashing. Read up on it a little bit, and it seems like it could help me with my problem.
Before I go read and learn any further, I have a couple of questions/confusions that I'd like to address, this way I can further invest my time in learning and reading about the better way to approach this problem.
My questions/confusions:
Any other approaches are welcome.
You can look at this problem as a clustering problem for datapoints that are each of 24 x 5 = 120 dimension (or features). Make sure to flatten each datapoint in the same fashion (row1row2row3... concatenated or col1col2col3... concatenated, just chose one and be consistent). You can take these 120 features for each datapoint and cluster them using K-means or any one of the hierarchical clustering family approaches or any other clustering approach (eg hashing can also be a type of clustering where the similarity is determined by the hash function).
For similarity metric you can try euclidean distance or cosine similarity as a metric (or any other eg symmetric KL divergence etc.). Cosine similarity + K-means becomes spherical K-means and is very popular for document clustering (where each word in the doc is treated as a feature).
For choosing the number of clusters (ie K in K-means or height of the dendogram in Hierarchical clustering), you can use the elbow method https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_elbow_method
Hope that helps.
Before performing clustering on any data, you should make it clear to yourself what is your similarity metric. In other words, what makes two heat maps similar? Also, you should ask yourself what makes heat maps very dissimilar. You may also want to clarify to yourself what a cluster means for your case. After answering these questions, you can choose the appropriate metric and clustering method. (people don't usually go through this process, either because they don't know enough clustering methods, or because they are lazy. Or sometimes they just don't want to make any assumptions about what kind of results they get. Then what they do is to try a few clustering methods that have implementations in their programming language, hoping that those methods will cover their needs)
Here is a list of some questions you might want to ask yourself, before choosing a clustering method:
Answering these question will help you, for example, answer these questions:
(Choosing a clustering method can also depend on its complexity, on its performance on large amounts of data, on whether it is parallel-able, on whether it can give you hierarchical clusters, on whether it's results allow an easy classification of new heat maps, and more)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.