如何从所有 pandas 列计算成对矩阵

Question

Consider I have dataframe:考虑我有 dataframe：

data = [[11, 10, 13], [16, 15, 45], [35, 14,9]] 
df = pd.DataFrame(data, columns = ['A', 'B', 'C']) 
df

The data looks like:数据如下：

    A   B   C
0   11  10  13
1   16  15  45
2   35  14  9

The real data consists of a hundred columns and thousand rows.真实数据由一百列和一千行组成。

I have a function, the aim of the function is to count how many values that higher than the minimum value of another column.我有一个 function，function 的目的是计算有多少值高于另一列的最小值。 The function looks like this: function 看起来像这样：

def get_count_higher_than_min(df, column_name_string, df_col_based):
    seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
    numOfRows = len(seriesObj[seriesObj == True].index)
    return numOfRows

Example output from the function like this:来自 function 的示例 output 如下所示：

get_count_higher_than_min(df, 'A', df['B'])

The output is 3 . output 是3 。 That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10 , so the output is 3 .那是因为df['B']的最小值是10并且df['A']的三个值都高于10 ，所以 output 是3 。

The problem is I want to compute the pairwise of all columns using that function问题是我想使用 function 计算所有列的成对

I don't know what an effective and efficient way to solve this issue.我不知道解决这个问题的有效方法是什么。 I want the output in the form of a similar to confusion matrix or similar to correlation matrix.我希望 output 的形式类似于混淆矩阵或类似于相关矩阵。

Example output:示例 output：

    A   B   C
A   X  3  X
B   X  X  X
C   X  X  X

Answer 1

from itertools import product
pairs = product(df.columns, repeat=2)

min_value = {}
output = []


for each_pair in pairs:
    # making sure that we are calculating min only once
    min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
    min_value[each_pair[1]] = min_
    
    count = df[df[each_pair[0]]>min_][each_pair[0]].count()
    output.append(count)
    
df_desired = pd.DataFrame(
    [output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))], 
    columns=df.columns, index=df.columns)

print(df_desired)

Answer 2

This is O(n ² m) where n is the number of columns and m the number of rows.这是 O(n ² m)，其中 n 是列数，m 是行数。

minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
                  for c in df.columns})

Result:结果：

>>> m
   A  B  C
A  2  3  3
B  2  2  3
C  2  2  2

In theory O(n log(n) m) is possible.理论上 O(n log(n) m) 是可能的。

如何从所有 pandas 列计算成对矩阵

问题描述

2 个解决方案

解决方案1
1 2021-02-21 09:58:45

解决方案2
1 已采纳 2021-02-21 10:18:00

如何从所有 pandas 列计算成对矩阵

问题描述

2 个解决方案

解决方案1 1 2021-02-21 09:58:45

解决方案2 1 已采纳 2021-02-21 10:18:00

解决方案1
1 2021-02-21 09:58:45

解决方案2
1 已采纳 2021-02-21 10:18:00