I would like to filter a multi-indexed DataFrame based on another DataFrame with a lower dimensional index, like in the following example:
import io
import pandas as pd
df1 = io.StringIO('''\
ID1 ID2 ID3 Value
1 1001 1 1
1 1001 2 2
1 1002 1 9
2 1001 1 3
2 1002 2 4
''')
df2 = io.StringIO('''\
ID1 ID2 Value
1 1001 2
2 1002 3
''')
expected_result = io.StringIO('''\
ID1 ID2 ID3 Value
1 1001 1 1
1 1001 2 2
2 1002 2 4
''')
df1 = pd.read_table(df1, sep='\s+').set_index(['ID1', 'ID2', 'ID3'])
df2 = pd.read_table(df2, sep='\s+').set_index(['ID1', 'ID2'])
expected_result = (pd.read_table(expected_result, sep='\s+')
.set_index(['ID1', 'ID2', 'ID3']))
assert all(df1.loc[d2.index] == expected_result) # won't work
If both dataframes have the same dimension one can simply:
df1.loc[df2.index]
which is equivalent to a list of same dimension indices, eg
df1.loc[(1, 1001, 1), (1, 1001, 2)]
It is also possible to select single elements based on a lower dimensional index like so:
d1.loc[(1, 1001)]
But how can I filter based on a list (or other index) with lower dimension?
It seems a bit tricky to get your desired result. As with pandas 0.19.2, the multi index label locater loc
seems to be buggy when supplying an iterable of exactly defined rows:
# this should give the correct result
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2))
# messes up with varying levels
print(df1.loc[desired_rows, :])
Value
ID1 ID2 ID3
1 1001 2 2
# when reducing the index to the first two same levels, it works
print(df1.loc[desired_rows[:2], :])
Value
ID1 ID2 ID3
1 1001 1 1
2 2
Therefore, we can't rely on loc
for your given example. In contrast, the multiindex index locator iloc
still works as expected. However, it requires you to get the corresponding index locations which is shown below:
df2_indices = set(df2.index.get_values())
df2_levels = len(df2.index.levels)
indices = [idx for idx, index in enumerate(df1.index)
if index[:df2_levels] in df2_indices]
print(df1.iloc[indices, :])
Value
ID1 ID2 ID3
1 1001 1 1
2 2
2 1002 2 4
An easier solution is to simply convert the desired_rows
tuples to a list because the loc
works more consistently with lists as a row locator:
df1.loc[list(desired_rows), :]
Value
ID1 ID2 ID3
1 1001 1 1
2 2
2 1002 2 4
You can do this by passing the individual index level values and then slice(None)
for the non-existent 3rd index level:
In [107]:
df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)]
Out[107]:
Value
ID1 ID2 ID3
1 1001 1 1
2 2
2 1001 1 3
2 4
Then we can see that all values match:
In [111]:
all(df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)] == expected_result)
Out[111]:
True
The problem is that because the indices are not the same dimensions you need to specify what to pass for the non-existent 3rd level, here passing slice(None)
will select all rows for that level so the masking will work
One way is to temporarily reduce the dimension of the higher dimensional index to then do a same-dimension filtering:
result = (df1.reset_index().set_index(['ID1', 'ID2']).loc[df2.index]
.reset_index().set_index(['ID1', 'ID2', 'ID3']))
assert all(result == expected_result) # will pass
It's quite involved though.
The list comparison can be done with isin
: first drop the additional dimensions from the index of the higher dimensional dataframe and then compare the remains with the index of the lower dimensional one. In this case:
mask = df1.index.droplevel(2).isin(df2.index)
assert all(df1[mask] == expected_result) # passes
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.