[英]How to compute pairwise matrix from all pandas columns
Consider I have dataframe:考虑我有 dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:数据如下:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.真实数据由一百列和一千行组成。
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column.我有一个 function,function 的目的是计算有多少值高于另一列的最小值。 The function looks like this:
function 看起来像这样:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:来自 function 的示例 output 如下所示:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3
. output 是
3
。 That is because the minimum value of df['B']
is 10
and three values from df['A']
are higher than 10
, so the output is 3
.那是因为
df['B']
的最小值是10
并且df['A']
的三个值都高于10
,所以 output 是3
。
The problem is I want to compute the pairwise of all columns using that function问题是我想使用 function 计算所有列的成对
I don't know what an effective and efficient way to solve this issue.我不知道解决这个问题的有效方法是什么。 I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
我希望 output 的形式类似于混淆矩阵或类似于相关矩阵。
Example output:示例 output:
A B C
A X 3 X
B X X X
C X X X
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
This is O(n 2 m) where n is the number of columns and m the number of rows.这是 O(n 2 m),其中 n 是列数,m 是行数。
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:结果:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.理论上 O(n log(n) m) 是可能的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.