简体   繁体   English

使用sklearn pairwise_distances计算X和y之间的距离相关

[英]using sklearn pairwise_distances to compute distance correlation between X and y

I am currently trying various methods: 1. Correlation. 我目前正在尝试各种方法:1.关联。 2. Mutual Information. 2.相互信息。 3. Distance Correlation to find the strength of relationship between the variables in X and the dependent variable in y. 3.距离相关以找出X中变量与y中因变量之间关系的强度。 Correlation is the fastest and simplest(1 hour on a sample to 3 million records and 560 variables). 关联是最快和最简单的(样本中1个小时有300万条记录和560个变量)。 Mutual Information calculation takes approximately 16 hours. 相互信息的计算大约需要16个小时。 I am also looking at distance correlation because of it's interesting property: The distance correlation between Xi and Y is zero , if and only if they are independent. 我也正在研究距离相关性,因为它具有有趣的特性:Xi和Y之间的距离相关性为零,当且仅当它们是独立的。 However I am facing a problem while doing the calculation in Python. 但是,在Python中进行计算时遇到了一个问题。

below is my data: 以下是我的数据:

X X

prop_tenure prop_12m    prop_6m prop_3m 
0.04        0.04        0.06    0.08
0           0           0       0
0           0           0       0
0.06        0.06        0.1     0
0.38        0.38        0.25    0
0.61        0.61        0.66    0.61
0.01        0.01        0.02    0.02
0.1         0.1         0.12    0.16
0.04        0.04        0.04    0.09
0.22        0.22        0.22    0.22
0.72        0.72        0.73    0.72
0.39        0.39        0.45    0.64

**y**
status
0
0
1
1
0
0
0
1
0
0
0
1

I want to capture the distance correlation of each variable in X with y and store it in a dataframe and hence I am doing. 我想用y捕获X中每个变量的距离相关性并将其存储在数据帧中,因此我正在做。

from sklearn.metrics.pairwise import pairwise_distances

num_metrics_df['distance_correlation'] = pairwise_distances(X,y,metric = 'correlation',njobs = -1)

However the documentation mentions the below: 但是,文档中提到以下内容:

If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X and Y.

This requires equal number of features in both X and Y? 这要求X和Y中的特征数量相等吗?

How can I get distance correlation between each Xi and y in python? 如何在python中获取每个Xi和y之间的距离相关性? Can someone please help me with this? 有人可以帮我吗?

Update: 更新:

I tried the approach of repeating the columns of y as per X.shape[1] and then do the calculation but it gives memory error for a sample of 10k records: 我尝试了按照X.shape [1]重复y列的方法,然后进行了计算,但对于10k条记录的样本,它给出了内存错误:

X = data_col.values

lb = preprocessing.LabelBinarizer()
df_target['drform'] = lb.fit_transform(df_target['status'])

y = df_target.values
n_rep = X.shape[1]
y = np.repeat(y,n_rep,axis = 1)

num_metrics_df['distance_correlation'] = pairwise_distances(X,y,metric = 'correlation',njobs = -1)

Traceback (most recent call last):

  File "<ipython-input-30-0f28f4b76a7e>", line 20, in <module>
    num_metrics_df['distance_correlation'] = pairwise_distances(X,y,metric = 'correlation',njobs = -1)

  File "C:\Users\test\AppData\Local\Continuum\anaconda3.1\lib\site-packages\sklearn\metrics\pairwise.py", line 1247, in pairwise_distances
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)

  File "C:\Users\test\AppData\Local\Continuum\anaconda3.1\lib\site-packages\sklearn\metrics\pairwise.py", line 1090, in _parallel_pairwise
    return func(X, Y, **kwds)

  File "C:\Users\test\AppData\Local\Continuum\anaconda3.1\lib\site-packages\scipy\spatial\distance.py", line 2381, in cdist
    dm = np.empty((mA, mB), dtype=np.double)

MemoryError

You can use scipy for this, although not explicitly parallelised, it is heavily optimised/vectorised. 您可以为此使用scipy ,尽管未进行显式并行化,但已对其进行了大幅优化/向量化。 I find that it works super quickly for big datasets. 我发现它对于大型数据集超级有效。

from scipy.spatial.distance import cdist
import numpy as np

n_samples = 100000
n_features = 50

X = np.random.random((n_samples, n_features))
y = np.random.choice([0, 1], size=(n_samples, 1))
correlations = cdist(X.T, y.T, metric='correlation')

But note that this returns a correlation distance, but there's a bunch of different metrics that you can use as well as custom metrics. 但是请注意,这会返回相关距离,但是您可以使用许多不同的指标以及自定义指标。 More details are on the docs page . 更多细节在文档页面上

Are you sure that you have computed what you wanted? 您确定您已经计算了想要的吗? It seems that scipy computes a distance based on (Pearson) correlation using this method. 似乎scipy使用这种方法基于(皮尔逊)相关性来计算距离。 Maybe you wanted Székely's distance correlation, as in https://pypi.org/project/dcor/ . 也许您想要Székely的距离相关性,如https://pypi.org/project/dcor/中所示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sklearn的带有metric =&#39;correlation&#39;的pairwise_distances有什么作用? - What does sklearn's pairwise_distances with metric='correlation' do? python / pandas / sklearn:从pairwise_distances获取最接近的匹配项 - python/pandas/sklearn: getting closest matches from pairwise_distances 使用自定义指标进行sklearn聚类:pairwise_distances投掷错误 - sklearn clustering with custom metric: pairwise_distances throwing error scipy成对距离与X.X + YY - XY ^ t之间的差异 - Difference between scipy pairwise distance and X.X+Y.Y - X.Y^t 与scikit-learn pairwise_distances中的n_jobs并行化 - parallelization with n_jobs in scikit-learn pairwise_distances scikit cosine_similarity vs pairwise_distances - scikit cosine_similarity vs pairwise_distances 有效地计算两个数据集之间的成对半正弦距离 - NumPy / Python - Efficiently compute pairwise haversine distances between two datasets - NumPy / Python 计算后续成对坐标之间的累积欧几里得距离 - Compute cumulative euclidean distances between subsequent pairwise coordinates 使用 pairwise_distances_chunked 计算最近邻搜索 - Using pairwise_distances_chunked to compute nearest neighbor search 计算欧氏距离时sklearn.metrics.pairwise_distances_argmin_min的奇怪结果 - Weird results of sklearn.metrics.pairwise_distances_argmin_min when computing euclidean distance
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM