简体   繁体   中英

cluster / find similar heatmap figures using python

I have the following sample images of heatmaps (I have hundreds of these images...for now, it will grow later):

heatmap1

heatmap2

heatmap3

heatmap4

Using my human eye, I'd say that heatmap1, 3 and 4 are similar to each other, or maybe 3 and 4 are most similar to each other, I'm not sure.

I'd like to be able to group the heatmap figures that are most similar to each other into different groups, based on the patterns and their intensities.

For example, each of the heatmap contains 24 rows and 5 columns (The rows represent time and the columns represent features). Each color in each column represents a number between 0 to 1. The pattern and intensities in column 1 for heatmap 3 and 4 is more similar compare to the other heatmaps . But, instead of looking at each column, I want to compare the overall patterns and intensities of each heatmap to one another.

I thought I was going to use kmeans clustering, but couldn't find any info that could help me in achieving what I want. My search ends up with hierarchical clustering quite a lot, which will not help me, from what I understand.

Then, I found some information on image hashing. Read up on it a little bit, and it seems like it could help me with my problem.

Before I go read and learn any further, I have a couple of questions/confusions that I'd like to address, this way I can further invest my time in learning and reading about the better way to approach this problem.

My questions/confusions:

  1. What is the best way to approach this problem? kmeans or image hashing?
  2. Is it even possible to do this using kmeans?

Any other approaches are welcome.

You can look at this problem as a clustering problem for datapoints that are each of 24 x 5 = 120 dimension (or features). Make sure to flatten each datapoint in the same fashion (row1row2row3... concatenated or col1col2col3... concatenated, just chose one and be consistent). You can take these 120 features for each datapoint and cluster them using K-means or any one of the hierarchical clustering family approaches or any other clustering approach (eg hashing can also be a type of clustering where the similarity is determined by the hash function).

For similarity metric you can try euclidean distance or cosine similarity as a metric (or any other eg symmetric KL divergence etc.). Cosine similarity + K-means becomes spherical K-means and is very popular for document clustering (where each word in the doc is treated as a feature).

For choosing the number of clusters (ie K in K-means or height of the dendogram in Hierarchical clustering), you can use the elbow method https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_elbow_method

Hope that helps.

Before performing clustering on any data, you should make it clear to yourself what is your similarity metric. In other words, what makes two heat maps similar? Also, you should ask yourself what makes heat maps very dissimilar. You may also want to clarify to yourself what a cluster means for your case. After answering these questions, you can choose the appropriate metric and clustering method. (people don't usually go through this process, either because they don't know enough clustering methods, or because they are lazy. Or sometimes they just don't want to make any assumptions about what kind of results they get. Then what they do is to try a few clustering methods that have implementations in their programming language, hoping that those methods will cover their needs)

Here is a list of some questions you might want to ask yourself, before choosing a clustering method:

  • If heat map A is a rotation of heat map B, would you call them similar?
  • If heat map A is a reflection of heat map B, would you call them similar?
  • If heat map A is a shifted version (a translation) of heat map B, would you call them similar?
  • If heat map A is a negative of heat map B, would you call them similar?
  • Are two pixels with a value difference of 0.01 just as dissimilar as two pixels with a value difference 0.9?
  • If heat map A is identical to heat map B, besides one pixel which is very different, would you call them similar? or dissimilar?
  • If heat map A's pixel values are all exactly half of heat map B's pixel values, would you call them similar?
  • If heat map A is very similar to heat map B, and heat map B is very similar to heat map C, are A and C also similar?
  • Can a cluster contain two heat maps which are not very similar to each other, provided that there exists a third heat map that is similar enough to both?
  • Can a heat map belong to more than one cluster?

Answering these question will help you, for example, answer these questions:

  • Should I use fuzzy or hard clustering?
  • What is the formula for my metric on the space of all heat maps?
  • Does my clustering method rely on the triangle inequality to work?
  • Should my clustering method allow extended continuous clusters (as viewed in feature space), where each member is similar only to its neighbors, or do all clusters need to be similar to each other?

(Choosing a clustering method can also depend on its complexity, on its performance on large amounts of data, on whether it is parallel-able, on whether it can give you hierarchical clusters, on whether it's results allow an easy classification of new heat maps, and more)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM