[英]Quickest way to count occurrence of values in multi-index pandas dataframe
I have two multi-index dataframe with many levels and columns. 我有两个具有多个级别和列的多索引数据框。 I'm looking for the quickest way to iterate over the dataframe and count, for each row, how many cells are above a specific value in each dataframe and then find the intersection of the rows of the two dataframe that scored at least one count.
我正在寻找迭代数据框并计数的最快方法,对于每一行,每个数据框中的特定值以上有多少个单元格,然后找到得分至少为一个的两个数据框的行的交点。
Right now I'm cycling through the dataframe with a combination of for loop and groupby but it's taking me too much time to find the right answer (my real dataframe comprise thousands of levels and hundreds of columns) so I need to find a different way to do this. 现在,我正在使用for循环和groupby组合遍历数据帧,但是花了我太多时间才能找到正确的答案(我的实际数据帧包含数千个级别和数百列),所以我需要找到一种不同的方法去做这个。
So for example: 因此,例如:
idx = pd.MultiIndex.from_product([[0,1],[0,1,2]],names=
['index_1','index_2'])
col = ['column_1', 'column_2']
values_list_a=[[1,2],[2,2],[2,1],[-8,1],[2,0],[2,1]]
DFA = pd.DataFrame(values_list_a, idx, col)
DFA:
columns_1 columns2
index_1 index_2
0 0 1 2
1 2 2
2 2 1
1 0 -8 1
1 2 0
2 2 1
values_list_b=[[2,2],[0,1],[2,2],[2,2],[1,0],[1,2]]
DFB = pd.DataFrame(values_list_b, idx, col)
DFB:
columns_1 columns2
index_1 index_2
0 0 2 2
1 0 1
2 2 2
1 0 2 2
1 1 0
2 1 2
What I expect is: 我期望的是:
Step 1 counting occurrence: 步骤1计算发生次数:
DFA:
columns_1 columns2 counts
index_1 index_2
0 0 1 2 1
1 2 2 2
2 2 1 1
1 0 -8 1 0
1 2 0 1
2 2 1 1
DFB:
columns_1 columns2 counts
index_1 index_2
0 0 2 2 2
1 0 1 0
2 2 2 2
1 0 2 2 2
1 1 0 0
2 1 2 1
Step 2: The intersection of the 2 dataframe with counts >0 should create a new dataframe like this (where the row of both dataframe that score at least one count in the same indices are recorded and a new index_0 level is added). 步骤2:计数> 0的2个数据框的交集应创建一个这样的新数据框(在该两个数据框中,在相同索引中得分至少为1的行的记录被记录,并添加新的index_0级别)。 the index_0 = 0 should refer to the DFA and index_0=1 to DFB:
index_0 = 0应该引用DFA,index_0 = 1引用DFB:
DFC:
columns_1 columns2 counts
index_0 index_1 index_2
0 0 0 1 2 1
2 2 1 1
1 2 2 1 1
1 0 0 2 2 2
2 2 2 2
1 2 1 2 1
df.groupby(['index_0','index_1', 'index2'])
now , you want to use having equivalent of sql, which is 现在,您要使用具有等同于sql的
df.filter(lambda x: len(x.columns_1) > 2)
df.count()
this is a concept , i didnt understand what do you want to filter, note that x is a group, so you need to operate on it (len, set, values) etc 这是一个概念,我不明白您要过滤的内容,请注意x是一个组,因此您需要对其进行操作(len,set,values)等
Use filter, .any() and pd.merge() 使用filter,.any()和pd.merge()
Recreate the dataframe: 重新创建数据框:
idx = pd.MultiIndex.from_product([[0,1],[0,1,2]], names=['one', 'two'])
columns = ['columns_1', 'columns_2']
DFA = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)
DFB = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)
print(DFA)
columns_1 columns_2
one two
0 0 -1 2
1 2 -1
2 -1 0
1 0 1 2
1 0 0
2 -1 -1
print(DFB)
columns_1 columns_2
one two
0 0 2 -1
1 1 2
2 2 1
1 0 0 0
1 -1 2
2 1 -1
Filter the dataframes for values > 1 in this instance. 在这种情况下,请过滤数据框以获取大于1的值。
DFA = DFA.loc[(DFA>1).any(bool_only=True, axis=1),:]
DFB = DFB.loc[(DFB>1).any(bool_only=True, axis=1),:]
print(DFA)
columns_1 columns_2
one two
0 0 -1 2
1 2 -1
1 0 1 2
print(DFB)
columns_1 columns_2
one two
0 0 2 -1
1 1 2
2 2 1
1 1 -1 2
Merge the two together. 将两者合并。 Using out join gets you close.
使用out join使您接近。 Not sure about jumping the index out, but the first level 0 [0,1] is DFA.
不确定是否要跳出索引,但是第一级0 [0,1]是DFA。
columns_1_x columns_2_x columns_1_y columns_2_y
one two
0 0 -1.0 2.0 2.0 -1.0
1 2.0 -1.0 1.0 2.0
1 0 1.0 2.0 NaN NaN
0 2 NaN NaN 2.0 1.0
1 1 NaN NaN -1.0 2.0
pd.concat
then magic
pd.concat
然后magic
def f(d, thresh=1):
c = d.gt(thresh).sum(1)
mask = c.gt(0).groupby(level=[1, 2]).transform('all')
return d.assign(counts=c)[mask]
pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)
column_1 column_2 counts
index_0 index_1 index_2
bar 0 0 1 2 1
2 2 1 1
1 2 2 1 1
foo 0 0 2 2 2
2 2 2 2
1 2 1 2 1
def f(d, thresh=1):
# count how many are greater than a threshold `thresh` per row
c = d.gt(thresh).sum(1)
# find where `counts` are > `0` for both dataframes
# conveniently dropped into one dataframe so we can do
# this nifty `groupby` trick
mask = c.gt(0).groupby(level=[1, 2]).transform('all')
# \-------/
# This is key to broadcasting over
# original index rather than collapsing
# over the index levels we grouped by
# create a new column named `counts`
# /------------\
return d.assign(counts=c)[mask]
# \--/
# filter with boolean mask
# Use concat to smash two dataframes together into one
pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.