根据较低维度值过滤多索引数据框

Question

I would like to filter a multi-indexed DataFrame based on another DataFrame with a lower dimensional index, like in the following example: 我想基于另一个具有较低维度索引的DataFrame来过滤多索引DataFrame，如以下示例所示：

import io
import pandas as pd

df1 = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
1          1002      1        9
2          1001      1        3
2          1002      2        4
''')

df2 = io.StringIO('''\
ID1        ID2      Value   
1          1001    2
2          1002    3
''')

expected_result = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
2          1002      2        4
''')

df1 = pd.read_table(df1, sep='\s+').set_index(['ID1', 'ID2', 'ID3'])
df2 = pd.read_table(df2, sep='\s+').set_index(['ID1', 'ID2'])
expected_result = (pd.read_table(expected_result, sep='\s+')
                   .set_index(['ID1', 'ID2', 'ID3']))

assert all(df1.loc[d2.index] == expected_result) # won't work

If both dataframes have the same dimension one can simply: 如果两个数据框具有相同的维度，则可以简单地：

df1.loc[df2.index]

which is equivalent to a list of same dimension indices, eg 等效于相同尺寸索引的列表，例如

df1.loc[(1, 1001, 1), (1, 1001, 2)]

It is also possible to select single elements based on a lower dimensional index like so: 也可以根据较低的尺寸索引选择单个元素，如下所示：

d1.loc[(1, 1001)]

But how can I filter based on a list (or other index) with lower dimension? 但是，如何根据维度较低的列表（或其他索引）进行过滤？

Answer 1

It seems a bit tricky to get your desired result. 要获得所需的结果似乎有些棘手。 As with pandas 0.19.2, the multi index label locater loc seems to be buggy when supplying an iterable of exactly defined rows: 与pandas 0.19.2一样，当提供可迭代的精确定义的行时，多索引标签定位器loc似乎有问题：

# this should give the correct result
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2))

# messes up with varying levels 
print(df1.loc[desired_rows, :])

                    Value
ID1     ID2     ID3     
1       1001    2   2

# when reducing the index to the first two same levels, it works
print(df1.loc[desired_rows[:2], :])

                    Value
ID1     ID2     ID3     
1       1001    1   1
2                   2

Therefore, we can't rely on loc for your given example. 因此，对于给定的示例，我们不能依靠loc 。 In contrast, the multiindex index locator iloc still works as expected. 相反，多索引索引定位器iloc仍可以按预期工作。 However, it requires you to get the corresponding index locations which is shown below: 但是，它要求您获取相应的索引位置，如下所示：

df2_indices = set(df2.index.get_values())
df2_levels = len(df2.index.levels)

indices = [idx for idx, index in enumerate(df1.index) 
           if index[:df2_levels] in df2_indices]

print(df1.iloc[indices, :])

                    Value
ID1     ID2     ID3     
1       1001    1   1
2                   2
2       1002    2   4

Update 15.07.2017 更新15.07.2017

An easier solution is to simply convert the desired_rows tuples to a list because the loc works more consistently with lists as a row locator: 一个简单的解决方案是简单地将转换desired_rows元组列表，因为loc工作更始终与列表作为行定位：

df1.loc[list(desired_rows), :]

                        Value
ID1     ID2     ID3     
1       1001    1           1
                2           2
2       1002    2           4

Answer 2

You can do this by passing the individual index level values and then slice(None) for the non-existent 3rd index level: 您可以通过传递各个索引级别的值，然后为不存在的第三个索引级别传递slice(None)来执行此操作：

In [107]:
df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)]

Out[107]:
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1001 1        3
         2        4

Then we can see that all values match: 然后我们可以看到所有值都匹配：

In [111]:
all(df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)] == expected_result)

Out[111]:
True

The problem is that because the indices are not the same dimensions you need to specify what to pass for the non-existent 3rd level, here passing slice(None) will select all rows for that level so the masking will work 问题在于，由于索引不是相同的维度，因此您需要为不存在的第三级指定要传递的内容，此处传递slice(None)将选择该级的所有行，因此屏蔽将起作用

Answer 3

One way is to temporarily reduce the dimension of the higher dimensional index to then do a same-dimension filtering: 一种方法是暂时减小高维索引的维，然后执行相同维度的过滤：

 result = (df1.reset_index().set_index(['ID1', 'ID2']).loc[df2.index]
           .reset_index().set_index(['ID1', 'ID2', 'ID3']))
 assert all(result == expected_result) # will pass

It's quite involved though. 虽然涉及很多。

Answer 4

The list comparison can be done with isin : first drop the additional dimensions from the index of the higher dimensional dataframe and then compare the remains with the index of the lower dimensional one. 可以使用isin进行列表比较：首先从较高维数据框的索引中删除其他维，然后将其余维与较低维数据框的索引进行比较。 In this case: 在这种情况下：

 mask = df1.index.droplevel(2).isin(df2.index)
 assert all(df1[mask] == expected_result) # passes

根据较低维度值过滤多索引数据框

问题描述

4 个解决方案

解决方案1
3 2017-03-02 16:41:56

Update 15.07.2017 更新15.07.2017

解决方案2
1 2017-03-02 16:38:51

解决方案3
0 2017-03-02 16:29:28

解决方案4
0 已采纳 2017-03-28 14:54:49

根据较低维度值过滤多索引数据框

问题描述

4 个解决方案

解决方案1 3 2017-03-02 16:41:56

Update 15.07.2017 更新15.07.2017

解决方案2 1 2017-03-02 16:38:51

解决方案3 0 2017-03-02 16:29:28

解决方案4 0 已采纳 2017-03-28 14:54:49

解决方案1
3 2017-03-02 16:41:56

解决方案2
1 2017-03-02 16:38:51

解决方案3
0 2017-03-02 16:29:28

解决方案4
0 已采纳 2017-03-28 14:54:49