简体   繁体   English

根据较低维度值过滤多索引数据框

[英]Filter multi-indexed dataframe based on lower dimensional values

I would like to filter a multi-indexed DataFrame based on another DataFrame with a lower dimensional index, like in the following example: 我想基于另一个具有较低维度索引的DataFrame来过滤多索引DataFrame,如以下示例所示:

import io
import pandas as pd

df1 = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
1          1002      1        9
2          1001      1        3
2          1002      2        4
''')

df2 = io.StringIO('''\
ID1        ID2      Value   
1          1001    2
2          1002    3
''')

expected_result = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
2          1002      2        4
''')

df1 = pd.read_table(df1, sep='\s+').set_index(['ID1', 'ID2', 'ID3'])
df2 = pd.read_table(df2, sep='\s+').set_index(['ID1', 'ID2'])
expected_result = (pd.read_table(expected_result, sep='\s+')
                   .set_index(['ID1', 'ID2', 'ID3']))

assert all(df1.loc[d2.index] == expected_result) # won't work

If both dataframes have the same dimension one can simply: 如果两个数据框具有相同的维度,则可以简单地:

df1.loc[df2.index]

which is equivalent to a list of same dimension indices, eg 等效于相同尺寸索引的列表,例如

df1.loc[(1, 1001, 1), (1, 1001, 2)]

It is also possible to select single elements based on a lower dimensional index like so: 也可以根据较低的尺寸索引选择单个元素,如下所示:

d1.loc[(1, 1001)]

But how can I filter based on a list (or other index) with lower dimension? 但是,如何根据维度较低的列表(或其他索引)进行过滤?

It seems a bit tricky to get your desired result. 要获得所需的结果似乎有些棘手。 As with pandas 0.19.2, the multi index label locater loc seems to be buggy when supplying an iterable of exactly defined rows: 与pandas 0.19.2一样,当提供可迭代的精确定义的行时,多索引标签定位器loc似乎有问题:

# this should give the correct result
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2))

# messes up with varying levels 
print(df1.loc[desired_rows, :])

                    Value
ID1     ID2     ID3     
1       1001    2   2

# when reducing the index to the first two same levels, it works
print(df1.loc[desired_rows[:2], :])

                    Value
ID1     ID2     ID3     
1       1001    1   1
2                   2

Therefore, we can't rely on loc for your given example. 因此,对于给定的示例,我们不能依靠loc In contrast, the multiindex index locator iloc still works as expected. 相反,多索引索引定位器iloc仍可以按预期工作。 However, it requires you to get the corresponding index locations which is shown below: 但是,它要求您获取相应的索引位置,如下所示:

df2_indices = set(df2.index.get_values())
df2_levels = len(df2.index.levels)

indices = [idx for idx, index in enumerate(df1.index) 
           if index[:df2_levels] in df2_indices]

print(df1.iloc[indices, :])

                    Value
ID1     ID2     ID3     
1       1001    1   1
2                   2
2       1002    2   4

Update 15.07.2017 更新15.07.2017

An easier solution is to simply convert the desired_rows tuples to a list because the loc works more consistently with lists as a row locator: 一个简单的解决方案是简单地将转换desired_rows元组列表,因为loc工作更始终与列表作为行定位:

df1.loc[list(desired_rows), :]

                        Value
ID1     ID2     ID3     
1       1001    1           1
                2           2
2       1002    2           4

You can do this by passing the individual index level values and then slice(None) for the non-existent 3rd index level: 您可以通过传递各个索引级别的值,然后为不存在的第三个索引级别传递slice(None)来执行此操作:

In [107]:
df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)]

Out[107]:
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1001 1        3
         2        4

Then we can see that all values match: 然后我们可以看到所有值都匹配:

In [111]:
all(df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)] == expected_result)

Out[111]:
True

The problem is that because the indices are not the same dimensions you need to specify what to pass for the non-existent 3rd level, here passing slice(None) will select all rows for that level so the masking will work 问题在于,由于索引不是相同的维度,因此您需要为不存在的第三级指定要传递的内容,此处传递slice(None)将选择该级的所有行,因此屏蔽将起作用

One way is to temporarily reduce the dimension of the higher dimensional index to then do a same-dimension filtering: 一种方法是暂时减小高维索引的维,然后执行相同维度的过滤:

 result = (df1.reset_index().set_index(['ID1', 'ID2']).loc[df2.index]
           .reset_index().set_index(['ID1', 'ID2', 'ID3']))
 assert all(result == expected_result) # will pass

It's quite involved though. 虽然涉及很多。

The list comparison can be done with isin : first drop the additional dimensions from the index of the higher dimensional dataframe and then compare the remains with the index of the lower dimensional one. 可以使用isin进行列表比较:首先从较高维数据框的索引中删除其他维,然后将其余维与较低维数据框的索引进行比较。 In this case: 在这种情况下:

 mask = df1.index.droplevel(2).isin(df2.index)
 assert all(df1[mask] == expected_result) # passes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM