简体   繁体   English

如何从所有 pandas 列计算成对矩阵

[英]How to compute pairwise matrix from all pandas columns

Consider I have dataframe:考虑我有 dataframe:

data = [[11, 10, 13], [16, 15, 45], [35, 14,9]] 
df = pd.DataFrame(data, columns = ['A', 'B', 'C']) 
df 

The data looks like:数据如下:

    A   B   C
0   11  10  13
1   16  15  45
2   35  14  9

The real data consists of a hundred columns and thousand rows.真实数据由一百列和一千行组成。

I have a function, the aim of the function is to count how many values that higher than the minimum value of another column.我有一个 function,function 的目的是计算有多少值高于另一列的最小值。 The function looks like this: function 看起来像这样:

def get_count_higher_than_min(df, column_name_string, df_col_based):
    seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
    numOfRows = len(seriesObj[seriesObj == True].index)
    return numOfRows

Example output from the function like this:来自 function 的示例 output 如下所示:

get_count_higher_than_min(df, 'A', df['B'])

The output is 3 . output 是3 That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10 , so the output is 3 .那是因为df['B']的最小值是10并且df['A']的三个值都高于10 ,所以 output 是3

The problem is I want to compute the pairwise of all columns using that function问题是我想使用 function 计算所有列的成对

I don't know what an effective and efficient way to solve this issue.我不知道解决这个问题的有效方法是什么。 I want the output in the form of a similar to confusion matrix or similar to correlation matrix.我希望 output 的形式类似于混淆矩阵或类似于相关矩阵。

Example output:示例 output:

    A   B   C
A   X  3  X
B   X  X  X
C   X  X  X
from itertools import product
pairs = product(df.columns, repeat=2)

min_value = {}
output = []


for each_pair in pairs:
    # making sure that we are calculating min only once
    min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
    min_value[each_pair[1]] = min_
    
    count = df[df[each_pair[0]]>min_][each_pair[0]].count()
    output.append(count)
    
df_desired = pd.DataFrame(
    [output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))], 
    columns=df.columns, index=df.columns)

print(df_desired)
   A  B  C
A  2  3  3
B  2  2  3
C  2  2  2

This is O(n 2 m) where n is the number of columns and m the number of rows.这是 O(n 2 m),其中 n 是列数,m 是行数。

minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
                  for c in df.columns})

Result:结果:

>>> m
   A  B  C
A  2  3  3
B  2  2  3
C  2  2  2

In theory O(n log(n) m) is possible.理论上 O(n log(n) m) 是可能的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM