[英]Filter multi-indexed dataframe based on lower dimensional values
I would like to filter a multi-indexed DataFrame based on another DataFrame with a lower dimensional index, like in the following example: 我想基于另一个具有较低维度索引的DataFrame来过滤多索引DataFrame,如以下示例所示:
import io
import pandas as pd
df1 = io.StringIO('''\
ID1 ID2 ID3 Value
1 1001 1 1
1 1001 2 2
1 1002 1 9
2 1001 1 3
2 1002 2 4
''')
df2 = io.StringIO('''\
ID1 ID2 Value
1 1001 2
2 1002 3
''')
expected_result = io.StringIO('''\
ID1 ID2 ID3 Value
1 1001 1 1
1 1001 2 2
2 1002 2 4
''')
df1 = pd.read_table(df1, sep='\s+').set_index(['ID1', 'ID2', 'ID3'])
df2 = pd.read_table(df2, sep='\s+').set_index(['ID1', 'ID2'])
expected_result = (pd.read_table(expected_result, sep='\s+')
.set_index(['ID1', 'ID2', 'ID3']))
assert all(df1.loc[d2.index] == expected_result) # won't work
If both dataframes have the same dimension one can simply: 如果两个数据框具有相同的维度,则可以简单地:
df1.loc[df2.index]
which is equivalent to a list of same dimension indices, eg 等效于相同尺寸索引的列表,例如
df1.loc[(1, 1001, 1), (1, 1001, 2)]
It is also possible to select single elements based on a lower dimensional index like so: 也可以根据较低的尺寸索引选择单个元素,如下所示:
d1.loc[(1, 1001)]
But how can I filter based on a list (or other index) with lower dimension? 但是,如何根据维度较低的列表(或其他索引)进行过滤?
It seems a bit tricky to get your desired result. 要获得所需的结果似乎有些棘手。 As with pandas 0.19.2, the multi index label locater
loc
seems to be buggy when supplying an iterable of exactly defined rows: 与pandas 0.19.2一样,当提供可迭代的精确定义的行时,多索引标签定位器
loc
似乎有问题:
# this should give the correct result
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2))
# messes up with varying levels
print(df1.loc[desired_rows, :])
Value
ID1 ID2 ID3
1 1001 2 2
# when reducing the index to the first two same levels, it works
print(df1.loc[desired_rows[:2], :])
Value
ID1 ID2 ID3
1 1001 1 1
2 2
Therefore, we can't rely on loc
for your given example. 因此,对于给定的示例,我们不能依靠
loc
。 In contrast, the multiindex index locator iloc
still works as expected. 相反,多索引索引定位器
iloc
仍可以按预期工作。 However, it requires you to get the corresponding index locations which is shown below: 但是,它要求您获取相应的索引位置,如下所示:
df2_indices = set(df2.index.get_values())
df2_levels = len(df2.index.levels)
indices = [idx for idx, index in enumerate(df1.index)
if index[:df2_levels] in df2_indices]
print(df1.iloc[indices, :])
Value
ID1 ID2 ID3
1 1001 1 1
2 2
2 1002 2 4
An easier solution is to simply convert the desired_rows
tuples to a list because the loc
works more consistently with lists as a row locator: 一个简单的解决方案是简单地将转换
desired_rows
元组列表,因为loc
工作更始终与列表作为行定位:
df1.loc[list(desired_rows), :]
Value
ID1 ID2 ID3
1 1001 1 1
2 2
2 1002 2 4
You can do this by passing the individual index level values and then slice(None)
for the non-existent 3rd index level: 您可以通过传递各个索引级别的值,然后为不存在的第三个索引级别传递
slice(None)
来执行此操作:
In [107]:
df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)]
Out[107]:
Value
ID1 ID2 ID3
1 1001 1 1
2 2
2 1001 1 3
2 4
Then we can see that all values match: 然后我们可以看到所有值都匹配:
In [111]:
all(df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)] == expected_result)
Out[111]:
True
The problem is that because the indices are not the same dimensions you need to specify what to pass for the non-existent 3rd level, here passing slice(None)
will select all rows for that level so the masking will work 问题在于,由于索引不是相同的维度,因此您需要为不存在的第三级指定要传递的内容,此处传递
slice(None)
将选择该级的所有行,因此屏蔽将起作用
One way is to temporarily reduce the dimension of the higher dimensional index to then do a same-dimension filtering: 一种方法是暂时减小高维索引的维,然后执行相同维度的过滤:
result = (df1.reset_index().set_index(['ID1', 'ID2']).loc[df2.index]
.reset_index().set_index(['ID1', 'ID2', 'ID3']))
assert all(result == expected_result) # will pass
It's quite involved though. 虽然涉及很多。
The list comparison can be done with isin
: first drop the additional dimensions from the index of the higher dimensional dataframe and then compare the remains with the index of the lower dimensional one. 可以使用
isin
进行列表比较:首先从较高维数据框的索引中删除其他维,然后将其余维与较低维数据框的索引进行比较。 In this case: 在这种情况下:
mask = df1.index.droplevel(2).isin(df2.index)
assert all(df1[mask] == expected_result) # passes
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.