简体   繁体   English

如何使用 seaborn 有效地 plot 距离矩阵?

[英]How can I efficiently plot a distance matrix using seaborn?

So I have a dataset of more ore less 11.000 records, with 4 features all them are discrete or continue.所以我有一个或多或少的 11.000 条记录的数据集,有 4 个特征,它们都是离散的或连续的。 I perform clustering using K-means, then I add the column "cluster" to the dataframe using kmeans.labels_ .我使用 K-means 执行聚类,然后使用 kmeans.labels_ 将列“cluster”添加到kmeans.labels_ Now I want to plot the distance matrix so I used pdist from scipy , but the matrix is not plotted.现在我想要 plot 距离矩阵,所以我使用了pdistscipy ,但没有绘制矩阵。

Here is my code.这是我的代码。

from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
import gc

# distance matrix
def distance_matrix(df_labeled, metric="euclidean"):
    df_labeled.sort_values(by=['cluster'], inplace=True)
    dist = pdist(df_labeled, metric)
    dist = squareform(dist)    
    sns.heatmap(dist, cmap="mako")
    print(dist)
    del dist
    gc.collect()

distance_matrix(finalDf)

Output: Output:

[[ 0.          2.71373462  3.84599479 ...  7.59910903  8.10265588
   8.27195104]
 [ 2.71373462  0.          2.94410672 ...  7.90444283  8.28225031
   8.48094661]
 [ 3.84599479  2.94410672  0.         ...  9.78706347 10.42014451
  10.61261498]
 ...
 [ 7.59910903  7.90444283  9.78706347 ...  0.          1.27795469
   1.44711258]
 [ 8.10265588  8.28225031 10.42014451 ...  1.27795469  0.
   0.52333107]
 [ 8.27195104  8.48094661 10.61261498 ...  1.44711258  0.52333107
   0.        ]]

I get the following graph:我得到下图:
在此处输入图像描述

As you can see, the plot is empty.如您所见,plot 是空的。 Also I have to free up some RAM because google colab crashes.我还必须释放一些 RAM,因为 google colab 崩溃了。

How can I solve the problem?我该如何解决这个问题?

It is possible that the issue could be with the size of the distance matrix.问题可能出在距离矩阵的大小上。 11,000 records would result in a distance matrix with 121,110,000 elements, which may be too large to effectively plot. One solution could be to try reducing the number of records in the dataframe before performing the distance matrix calculation. 11,000 条记录将导致具有 121,110,000 个元素的距离矩阵,这可能太大而无法有效地计算 plot。一种解决方案是在执行距离矩阵计算之前尝试减少 dataframe 中的记录数。 This could be done through techniques such as sampling or feature selection.这可以通过采样或特征选择等技术来完成。

Another potential issue could be with the metric being used.另一个潜在问题可能与所使用的指标有关。 Some metrics, such as Euclidean distance, may not be effective for discrete or categorical features.某些指标(例如欧氏距离)可能对离散或分类特征无效。 It may be worth trying a different metric, such as Manhattan distance, which is more suitable for these types of features.可能值得尝试不同的度量标准,例如曼哈顿距离,它更适合这些类型的特征。

It is also possible that there could be other issues with the code, such as incorrect imports or syntax errors.代码也可能存在其他问题,例如不正确的导入或语法错误。 It may be worth double checking the code and ensuring that all necessary libraries are imported and properly referenced.可能值得仔细检查代码并确保导入并正确引用所有必需的库。

The original question was well-phrased but was not a reprex .最初的问题措辞很好,但不是reprex Its code, at least the part we can see, appears to work fine.它的代码,至少我们可以看到的部分,似乎工作正常。

Here is a demo of producing a heatmap for another dataset that also has 11 K rows.这是为另一个也有 11K 行的数据集生成热图的演示。

from scipy.spatial.distance import pdist, squareform
from uszipcode import SearchEngine
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


def distance_matrix(df: pd.DataFrame, metric="euclidean"):
    df = df[["zipcode", "lat", "lng", "population_density"]]
    df = df.sort_values(by=["zipcode"])
    print(df)
    dist = pdist(df, metric)
    dist = squareform(dist)
    sns.heatmap(dist, cmap="mako")
    print(dist)
    plt.show()


def get_df() -> pd.DataFrame:
    zips = SearchEngine().by_population_density(lower=100, returns=11_000)
    df = pd.DataFrame(z.to_dict() for z in zips)
    df["zipcode"] = df.zipcode.astype(int)
    return df


distance_matrix(get_df())

It consumes at least ten GiB under MacOS 12.6.2, using cPython 3.10.8, matplotlib 3.6.2, scipy 1.9.3, seaborn 0.12.1.它在 MacOS 12.6.2 下至少消耗 10 GiB,使用 cPython 3.10.8、matplotlib 3.6.2、scipy 1.9.3、seaborn 0.12.1。

It displays this:它显示这个: 热图

To plot a distance matrix using seaborn, you can use the seaborn.heatmap() function.要plot一个距离矩阵使用seaborn,可以使用seaborn.heatmap() function。

It is possible that the distance matrix is too large to be plotted using a heatmap.距离矩阵可能太大而无法使用热图绘制。 The sns.heatmap function expects a 2D array, but the output of pdist is a 1D array of distances between all pairs of points in the input. sns.heatmap function 需要一个二维数组,但 pdist 的pdist是输入中所有点对之间距离的一维数组。 When you use squareform to convert this array to a 2D distance matrix, the resulting matrix may have too many rows and columns to be plotted effectively.当您使用squareform将此数组转换为二维距离矩阵时,生成的矩阵可能包含太多行和列,无法有效绘制。

One solution to this problem is to downsample the distance matrix by selecting a subset of the rows and columns to plot. You can do this by using the iloc indexer to select a slice of the distance matrix:此问题的一种解决方案是通过选择行和列的子集到 plot 来对距离矩阵进行下采样。您可以通过使用iloc索引器到 select 距离矩阵的一部分来执行此操作:

sns.heatmap(dist.iloc[:100, :100], cmap="mako")

This will plot a 100x100 submatrix of the distance matrix, which may be small enough to be plotted effectively.这将 plot 距离矩阵的 100x100 子矩阵,它可能小到足以有效地绘制。 You can adjust the size of the submatrix as needed to find a balance between detail and legibility.您可以根据需要调整子矩阵的大小,以在细节和易读性之间找到平衡。

Alternatively, you can try using a different plot type to visualize the distance matrix.或者,您可以尝试使用不同的 plot 类型来可视化距离矩阵。 For example, you could use a scatterplot to plot the distance between each pair of points, or you could use a dendrogram to show the hierarchical structure of the distances.例如,您可以使用散点图 plot 每对点之间的距离,或者您可以使用树状图来显示距离的层次结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM