简体   繁体   English

计算数据帧所有行之间的成对欧几里得距离

[英]Calculating pairwise Euclidean distance between all the rows of a dataframe

How can I calculate the Euclidean distance between all the rows of a dataframe?如何计算数据帧所有行之间的欧几里得距离? I am trying this code, but it is not working:我正在尝试此代码,但它不起作用:

zero_data = data
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
result.head()

This is how my (44062 by 278) dataframe looks like:这是我的(44062 x 278)数据框的样子:

请在此处查看示例数据

To compute the Eucledian distance between two rows i and j of a dataframe df:要计算数据帧 df 的两行 i 和 j 之间的欧几里德距离:

np.linalg.norm(df.loc[i] - df.loc[j])

To compute it between consecutive rows, ie 0 and 1, 1 and 2, 2 and 3, ...要在连续行之间计算它,即 0 和 1、1 和 2、2 和 3,...

np.linalg.norm(df.diff(axis=0).drop(0), axis=1)

If you want to compute it between all the rows, ie 0 and 1, 0 and 2, ..., 1 and 1, 1 and 2 ..., then you have to loop through all the combinations of i and j (keep in mind that for 44062 rows there are 970707891 such combinations so using a for-loop will be very slow):如果你想在所有行之间计算它,即 0 和 1、0 和 2、...、1 和 1、1 和 2 ...,那么你必须遍历 i 和 j 的所有组合(保持请记住,对于 44062 行,有 970707891 个这样的组合,因此使用 for 循环会非常慢):

import itertools

for i, j in itertools.combinations(df.index, 2):
    d_ij = np.linalg.norm(df.loc[i] - df.loc[j])

Edit:编辑:

Instead, you can use scipy.spatial.distance.cdist which computes distance between each pair of two collections of inputs:相反,您可以使用scipy.spatial.distance.cdist来计算每对两个输入集合之间的距离:

from scipy.spatial.distance import cdist

cdist(df, df, 'euclid')

This will return you a symmetric (44062 by 44062) matrix of Euclidian distances between all the rows of your dataframe.这将返回一个对称(44062 x 44062)的数据帧所有行之间的欧几里得距离矩阵。 The problem is that you need a lot of memory for it to work (at least 8*44062**2 bytes of memory, ie ~16GB).问题是您需要大量内存才能工作(至少 8*44062**2 字节的内存,即 ~16GB)。 So a better option is to use pdist所以更好的选择是使用pdist

from scipy.spatial.distance import pdist

pdist(df.values, 'euclid')

which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.这将返回 df 行之间所有成对欧几里得距离的数组(大小为 970707891)。

Ps Don't forget to ignore the 'Actual_data' column in the computations of distances. Ps 不要忘记在计算距离时忽略“Actual_data”列。 Eg you can do the following: data = df.drop('Actual_Data', axis=1).values and then cdist(data, data, 'euclid') or pdist(data, 'euclid') .例如,您可以执行以下操作: data = df.drop('Actual_Data', axis=1).values然后cdist(data, data, 'euclid') pdist(data, 'euclid') cdist(data, data, 'euclid')pdist(data, 'euclid') You can also create another dataframe with distances like this:您还可以创建另一个具有如下距离的数据框:

data = df.drop('Actual_Data', axis=1).values

d = pd.DataFrame(itertools.combinations(df.index, 2), columns=['i','j'])
d['dist'] = pdist(data, 'euclid')


   i  j  dist
0  0  1  ...
1  0  2  ...
2  0  3  ...
3  0  4  ...
...

Working with a subset of your data for eg.使用您的数据子集,例如。

df_data = [[888888, 3, 0, 0],
 [677767, 0, 2, 1],
 [212341212, 0, 0, 0],
 [141414141414, 0, 0, 0],
 [1112224, 0, 0, 0]]

# Creating the data
df = pd.DataFrame(data=data, columns=['Actual_Data', '8,8', '6,6', '7,7'], dtype=np.float64)

# Which looks like
#     Actual_Data  8,8  6,6  7,7
# 0  8.888880e+05  3.0  0.0  0.0
# 1  6.777670e+05  0.0  2.0  1.0
# 2  2.123412e+08  0.0  0.0  0.0
# 3  1.414141e+11  0.0  0.0  0.0
# 4  1.112224e+06  0.0  0.0  0.0

# Computing the distance matrix
dist_matrix = df.apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

# Which looks like
# 0     [0.0, 211121.00003315636, 211452324.0, 141413252526.0, 223336.000020149]
# 1    [211121.00003315636, 0.0, 211663445.0, 141413463647.0, 434457.0000057543]
# 2                 [211452324.0, 211663445.0, 0.0, 141201800202.0, 211228988.0]
# 3        [141413252526.0, 141413463647.0, 141201800202.0, 0.0, 141413029190.0]
# 4      [223336.000020149, 434457.0000057543, 211228988.0, 141413029190.0, 0.0]

# Reformatting the above into readable format
dist_matrix = pd.DataFrame(
  data=dist_matrix.values.tolist(), 
  columns=df.index.tolist(), 
  index=df.index.tolist())

# Which gives you
#               0             1             2             3             4
# 0  0.000000e+00  2.111210e+05  2.114523e+08  1.414133e+11  2.233360e+05
# 1  2.111210e+05  0.000000e+00  2.116634e+08  1.414135e+11  4.344570e+05
# 2  2.114523e+08  2.116634e+08  0.000000e+00  1.412018e+11  2.112290e+08
# 3  1.414133e+11  1.414135e+11  1.412018e+11  0.000000e+00  1.414130e+11
# 4  2.233360e+05  4.344570e+05  2.112290e+08  1.414130e+11  0.000000e+00

Update更新

as pointed out in the comments the issue is memory overflow so we have to operate the problem in batches.正如评论中指出的问题是memory overflow所以我们必须批量操作问题。

# Collecting the data
# df = ....

# Set this number to a lower value if you get the same `memory` errors.
batch = 200 # #'s of row's / user's used to compute the matrix

# To be conservative, let's write the intermediate results to file type.
dffname = []

for ifile,_slice in enumerate(np.array_split(range(df.shape[0]), batch)):

  # Let's compute distance for `batch` #'s of points in data frame
  tmp_df = df.iloc[_slice, :].apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

  tmp_df = pd.DataFrame(tmp_df.values.tolist(), index=df.index.values[_slice], columns=df.index.values)

  # You can change it from csv to any other files
  tmp_df.to_csv(f"{ifile+1}.csv")
  dffname.append(f"{ifile+1}.csv")

# Reading back the dataFrames
dflist = []
for f in dffname:
  dflist.append(pd.read_csv(f, dtype=np.float64, index_col=0))

res = pd.concat(dflist)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM