简体   繁体   English

使用python聚类/查找类似的热图图

[英]cluster / find similar heatmap figures using python

I have the following sample images of heatmaps (I have hundreds of these images...for now, it will grow later): 我有以下热图示例图像(我有数百张这些图像...目前,它将在以后增长):

heatmap1

heatmap2

heatmap3

heatmap4

Using my human eye, I'd say that heatmap1, 3 and 4 are similar to each other, or maybe 3 and 4 are most similar to each other, I'm not sure. 用我的肉眼,我会说heatmap1、3和4彼此相似,或者3和4彼此最相似,我不确定。

I'd like to be able to group the heatmap figures that are most similar to each other into different groups, based on the patterns and their intensities. 我希望能够根据模式及其强度将彼此最相似的热图数据分为不同的组。

For example, each of the heatmap contains 24 rows and 5 columns (The rows represent time and the columns represent features). 例如,每个热图包含24行和5列(行代表时间,列代表要素)。 Each color in each column represents a number between 0 to 1. The pattern and intensities in column 1 for heatmap 3 and 4 is more similar compare to the other heatmaps . 每列中的每种颜色表示0到1之间的一个数字。热图3和4的第1列中的图案和强度与其他热图相比更相似 But, instead of looking at each column, I want to compare the overall patterns and intensities of each heatmap to one another. 但是,我不想看每一列,而是希望将每个热图的总体模式和强度相互比较。

I thought I was going to use kmeans clustering, but couldn't find any info that could help me in achieving what I want. 我以为我要使用kmeans集群,但是找不到任何可以帮助我实现自己想要的信息。 My search ends up with hierarchical clustering quite a lot, which will not help me, from what I understand. 根据我的理解,我的搜索最终导致了很多层次聚类,这对我没有帮助。

Then, I found some information on image hashing. 然后,我发现了一些有关图像哈希的信息。 Read up on it a little bit, and it seems like it could help me with my problem. 仔细阅读,似乎可以帮助我解决问题。

Before I go read and learn any further, I have a couple of questions/confusions that I'd like to address, this way I can further invest my time in learning and reading about the better way to approach this problem. 在继续阅读和学习之前,我要解决几个问题/困惑,这样我就可以进一步花时间学习和阅读解决该问题的更好方法。

My questions/confusions: 我的问题/困惑:

  1. What is the best way to approach this problem? 解决此问题的最佳方法是什么? kmeans or image hashing? kmeans或图像哈希?
  2. Is it even possible to do this using kmeans? 甚至可以使用kmeans来做到这一点吗?

Any other approaches are welcome. 任何其他方法都欢迎。

You can look at this problem as a clustering problem for datapoints that are each of 24 x 5 = 120 dimension (or features). 您可以将这个问题看作是每个24 x 5 = 120维度(或要素)的数据点的聚类问题。 Make sure to flatten each datapoint in the same fashion (row1row2row3... concatenated or col1col2col3... concatenated, just chose one and be consistent). 确保以相同的方式展平每个数据点(row1row2row3 ...串联或col1col2col3 ...串联,只需选择一个并保持一致)。 You can take these 120 features for each datapoint and cluster them using K-means or any one of the hierarchical clustering family approaches or any other clustering approach (eg hashing can also be a type of clustering where the similarity is determined by the hash function). 您可以为每个数据点使用这120个特征,并使用K-means或任何一种层次化聚类方法或任何其他聚类方法对其进行聚类(例如,散列也可以是一种由散列函数确定相似性的聚类类型) 。

For similarity metric you can try euclidean distance or cosine similarity as a metric (or any other eg symmetric KL divergence etc.). 对于相似性度量,您可以尝试使用欧几里德距离或余弦相似性作为度量(或其他任何形式,例如对称KL散度等)。 Cosine similarity + K-means becomes spherical K-means and is very popular for document clustering (where each word in the doc is treated as a feature). 余弦相似度+ K均值变为球形K均值,并且在文档聚类中非常流行(文档中的每个单词都被视为特征)。

For choosing the number of clusters (ie K in K-means or height of the dendogram in Hierarchical clustering), you can use the elbow method https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_elbow_method 要选择簇数(即,K均值中的K或分层聚类中树状图的高度),可以使用弯头方法https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_elbow_method

Hope that helps. 希望能有所帮助。

Before performing clustering on any data, you should make it clear to yourself what is your similarity metric. 在对任何数据执行聚类之前,您应该先弄清楚自己的相似性指标是什么。 In other words, what makes two heat maps similar? 换句话说,是什么使两个热图相似? Also, you should ask yourself what makes heat maps very dissimilar. 另外,您应该问自己,是什么使热图非常不同。 You may also want to clarify to yourself what a cluster means for your case. 您可能还想向自己说明集群对您的情况意味着什么。 After answering these questions, you can choose the appropriate metric and clustering method. 回答这些问题后,您可以选择适当的度量标准和聚类方法。 (people don't usually go through this process, either because they don't know enough clustering methods, or because they are lazy. Or sometimes they just don't want to make any assumptions about what kind of results they get. Then what they do is to try a few clustering methods that have implementations in their programming language, hoping that those methods will cover their needs) (人们通常不会经历此过程,要么是因为他们不了解足够的聚类方法,要么是因为他们很懒。或者有时,他们只是不想对他们所获得的结果作任何假设。他们要做的是尝试一些采用编程语言实现的聚类方法,希望这些方法能够满足他们的需求。

Here is a list of some questions you might want to ask yourself, before choosing a clustering method: 以下是在选择聚类方法之前可能要问自己的一些问题的列表:

  • If heat map A is a rotation of heat map B, would you call them similar? 如果热图A是热图B的旋转,您会称它们类似吗?
  • If heat map A is a reflection of heat map B, would you call them similar? 如果热图A是热图B的反映,那么您会称它们类似吗?
  • If heat map A is a shifted version (a translation) of heat map B, would you call them similar? 如果热图A是热图B的平移版本(翻译),您会称其类似吗?
  • If heat map A is a negative of heat map B, would you call them similar? 如果热图A是热图B的负值,您会称它们类似吗?
  • Are two pixels with a value difference of 0.01 just as dissimilar as two pixels with a value difference 0.9? 值差为0.01的两个像素与值差为0.9的两个像素是否一样?
  • If heat map A is identical to heat map B, besides one pixel which is very different, would you call them similar? 如果热图A与热图B相同,除了一个非常不同的像素之外,您会称它们相似吗? or dissimilar? 或不同?
  • If heat map A's pixel values are all exactly half of heat map B's pixel values, would you call them similar? 如果热图A的像素值都恰好是热图B的像素值的一半,您会称它们为相似值吗?
  • If heat map A is very similar to heat map B, and heat map B is very similar to heat map C, are A and C also similar? 如果热图A与热图B非常相似,并且热图B与热图C非常相似,那么A和C是否也相似?
  • Can a cluster contain two heat maps which are not very similar to each other, provided that there exists a third heat map that is similar enough to both? 如果存在一个彼此都非常相似的第三热图,一个集群是否可以包含两个彼此不太相似的热图?
  • Can a heat map belong to more than one cluster? 热图可以属于多个群集吗?

Answering these question will help you, for example, answer these questions: 回答这些问题将帮助您,例如,回答以下问题:

  • Should I use fuzzy or hard clustering? 我应该使用模糊聚类还是硬聚类?
  • What is the formula for my metric on the space of all heat maps? 在所有热图空间上衡量我的指标的公式是什么?
  • Does my clustering method rely on the triangle inequality to work? 我的聚类方法是否依赖三角不等式起作用?
  • Should my clustering method allow extended continuous clusters (as viewed in feature space), where each member is similar only to its neighbors, or do all clusters need to be similar to each other? 我的聚类方法是否应该允许扩展的连续聚类(在要素空间中查看),其中每个成员仅与其邻居相似,或者所有聚类是否需要彼此相似?

(Choosing a clustering method can also depend on its complexity, on its performance on large amounts of data, on whether it is parallel-able, on whether it can give you hierarchical clusters, on whether it's results allow an easy classification of new heat maps, and more) (选择群集方法还取决于其复杂性,对大量数据的性能,是否可并行,是否可以为您提供分层群集,其结果是否允许对新的热图进行轻松分类( , 和更多)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM