简体   繁体   中英

Filter multi-indexed dataframe based on lower dimensional values

I would like to filter a multi-indexed DataFrame based on another DataFrame with a lower dimensional index, like in the following example:

import io
import pandas as pd

df1 = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
1          1002      1        9
2          1001      1        3
2          1002      2        4
''')

df2 = io.StringIO('''\
ID1        ID2      Value   
1          1001    2
2          1002    3
''')

expected_result = io.StringIO('''\
ID1        ID2       ID3    Value   
1          1001      1        1
1          1001      2        2
2          1002      2        4
''')

df1 = pd.read_table(df1, sep='\s+').set_index(['ID1', 'ID2', 'ID3'])
df2 = pd.read_table(df2, sep='\s+').set_index(['ID1', 'ID2'])
expected_result = (pd.read_table(expected_result, sep='\s+')
                   .set_index(['ID1', 'ID2', 'ID3']))

assert all(df1.loc[d2.index] == expected_result) # won't work

If both dataframes have the same dimension one can simply:

df1.loc[df2.index]

which is equivalent to a list of same dimension indices, eg

df1.loc[(1, 1001, 1), (1, 1001, 2)]

It is also possible to select single elements based on a lower dimensional index like so:

d1.loc[(1, 1001)]

But how can I filter based on a list (or other index) with lower dimension?

It seems a bit tricky to get your desired result. As with pandas 0.19.2, the multi index label locater loc seems to be buggy when supplying an iterable of exactly defined rows:

# this should give the correct result
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2))

# messes up with varying levels 
print(df1.loc[desired_rows, :])

                    Value
ID1     ID2     ID3     
1       1001    2   2

# when reducing the index to the first two same levels, it works
print(df1.loc[desired_rows[:2], :])

                    Value
ID1     ID2     ID3     
1       1001    1   1
2                   2

Therefore, we can't rely on loc for your given example. In contrast, the multiindex index locator iloc still works as expected. However, it requires you to get the corresponding index locations which is shown below:

df2_indices = set(df2.index.get_values())
df2_levels = len(df2.index.levels)

indices = [idx for idx, index in enumerate(df1.index) 
           if index[:df2_levels] in df2_indices]

print(df1.iloc[indices, :])

                    Value
ID1     ID2     ID3     
1       1001    1   1
2                   2
2       1002    2   4

Update 15.07.2017

An easier solution is to simply convert the desired_rows tuples to a list because the loc works more consistently with lists as a row locator:

df1.loc[list(desired_rows), :]

                        Value
ID1     ID2     ID3     
1       1001    1           1
                2           2
2       1002    2           4

You can do this by passing the individual index level values and then slice(None) for the non-existent 3rd index level:

In [107]:
df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)]

Out[107]:
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1001 1        3
         2        4

Then we can see that all values match:

In [111]:
all(df1.loc[df2.index.get_level_values(0), df2.index.get_level_values(1),slice(None)] == expected_result)

Out[111]:
True

The problem is that because the indices are not the same dimensions you need to specify what to pass for the non-existent 3rd level, here passing slice(None) will select all rows for that level so the masking will work

One way is to temporarily reduce the dimension of the higher dimensional index to then do a same-dimension filtering:

 result = (df1.reset_index().set_index(['ID1', 'ID2']).loc[df2.index]
           .reset_index().set_index(['ID1', 'ID2', 'ID3']))
 assert all(result == expected_result) # will pass

It's quite involved though.

The list comparison can be done with isin : first drop the additional dimensions from the index of the higher dimensional dataframe and then compare the remains with the index of the lower dimensional one. In this case:

 mask = df1.index.droplevel(2).isin(df2.index)
 assert all(df1[mask] == expected_result) # passes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM