[英]How can i find the count of freuency of repeated combination in DataFrame
I have a this data set as sample:我有一个这个数据集作为样本:
df = pd.DataFrame({'CL1':['A B C','C A N']},
columns=['CL1','CL2','CL3','CL4'])
CL1 CL2 CL3 CL4
0 A B C NaN NaN NaN
1 C A N NaN NaN NaN
CL2
:使用 (,) 作为分隔符分隔每个值,并在CL2
列中添加: CL1 CL2 CL3 CL4
0 'A B C' 'A,B,C' NaN NaN
1 'C A N' 'C,A,N' NaN NaN
CL2
in column CL3
: CL3
列中CL2
列中的值分离: CL1 CL2 CL3 CL4
0 'A B C' 'A,B,C' 'A','B','C' NaN
1 'C A N' 'C,A,N' 'C','A','N' NaN
CL4
CL4
列的并集(来自统计的集合论) CL1 CL2 CL3 CL4
0 'A B C' 'A,B,C' 'A','B','C' [ [A],[B],[C],[A,B],[A,C],[B,C],[A,B,C] ]
1 'C A N' 'C,A,N' 'C','A','N' [ [C],[A],[N],[A,C],[C,N],[A,N],[C,A,N] ]
CL4
in new column CL5
in new data frame and add to Count
:在新数据框中查找新列CL5
中列CL4
的每个值的重复并添加到Count
: CL5 Count
0 [A] 2
1 [B] 1
2 [C] 2
3 [D] 1
4 [N] 1
5 [A,B] 1
etc..
You can use split
by values by spacem then call custom function for all combinations and for counts use Series.explode
with Series.value_counts
:您可以通过 spacem 使用按值split
,然后为所有组合和计数调用自定义 function 和使用Series.explode
和Series.value_counts
:
df = pd.DataFrame({'CL1':['A B C','C A N','D E F','F X G']},
columns=['CL1','CL2','CL3','CL4'])
#https://stackoverflow.com/a/5898031/2901002
from itertools import chain, combinations
def all_subsets(ss):
return chain(*map(lambda x: combinations(ss, x), range(1, len(ss)+1)))
df = (df['CL1'].apply(lambda x: list(all_subsets(x.split())))
.explode()
.value_counts()
.rename_axis('CL5')
.reset_index(name='count'))
print (df.head(10))
CL5 count
0 (C,) 2
1 (F,) 2
2 (A,) 2
3 (E, F) 1
4 (F, G) 1
5 (A, B) 1
6 (C, A) 1
7 (A, C) 1
8 (F, X, G) 1
9 (D,) 1
df['CL5'] = df['CL5'].apply(list)
print (df.head(10))
CL5 count
0 [C] 2
1 [F] 2
2 [A] 2
3 [E, F] 1
4 [F, G] 1
5 [A, B] 1
6 [C, A] 1
7 [A, C] 1
8 [F, X, G] 1
9 [D] 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.