MultiIndex dataframes are very powerful but personally I think there is no enough (clear) documentations on it, specially for different type of slicing... Here is my question:
How to slice a multi-indexed dataframe just on one level with a list of labels? Please help me if you have a solution ( without reseting indexes and converting the dataframe to single level index! Which is obvious and not efficient )
For example, we have following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['id'] = pd.Series(range(10,20))
df['name'] = [f'name_{id}' for id in range(10,20)]
df['price'] = np.random.rand(df.index.size)
df['date'] = pd.date_range('20200310', '20200319')
df = df.set_index(['id', 'date'])
df
Slicing on one label is working just fine:
df.xs('2020-03-10', level='date', drop_level=False)
But how can we slice on a list of labels on that level?
df.xs(('2020-03-10', '2020-03-11', '2020-03-12'), level='date', drop_level=False)
This leads to an exception:
However Python doc says that "key" parameter could be a tuple as well:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html
For filter by multiple values use Index.get_level_values
with Index.isin
and boolean indexing
:
a = df[df.index.get_level_values('date').isin(('2020-03-10', '2020-03-11', '2020-03-12'))]
print (a)
name price
id date
10 2020-03-10 name_10 0.557772
11 2020-03-11 name_11 0.122315
12 2020-03-12 name_12 0.775976
However Python doc says that "key" parameter could be a tuple as well:
Tuple is possible use, but working differently - you can select by both labels like:
b = df.xs((10, '2020-03-10'), drop_level=False)
print (b)
name name_10
price 0.348808
Name: (10, 2020-03-10 00:00:00), dtype: object
c = df.xs((10, '2020-03-10'), level=('id','date'), drop_level=False)
print (c)
name price
id date
10 2020-03-10 name_10 0.239876
Like @yatu mentioned, another solution with IndexSlice
is with :
for all first levels and last :
for all columns:
df = df.loc[pd.IndexSlice[:, ['2020-03-10', '2020-03-11', '2020-03-12']], :]
print (df)
name price
id date
10 2020-03-10 name_10 0.557488
11 2020-03-11 name_11 0.592082
12 2020-03-12 name_12 0.547747
The use of tuples when accessing multiindex is meant to address the different levels/hierarchy. Tuples are meant for this use, not as a form of passing multiple items within the same hierarchy/level. For multiple selections within the same level you need to use some other functions such as the one Jezrael .
dates = ['2020-03-10', '2020-03-11', '2020-03-12']
filtered_df = df[df.index.get_level_values('date').isin(dates)]
This is a slight variation from the answer provided by @jezrael.
You can use loc()
combined with slice(None)
like this:
dates = ['2020-03-10', '2020-03-11', '2020-03-12']
df.loc[(slice(None), dates), :]
id date name price
10 2020-03-10 name_10 0.36806
11 2020-03-11 name_11 0.20436
12 2020-03-12 name_12 0.00443
The first argument in .loc
is a tuple that selects rows in the MultiIndex. slice(None)
gets all the values from the first level id
. The list dates
filters keys in the second level date
. The second argument :
selects all columns.
In the Pandas Documentation - MultiIndex - Advanced Indexing you can find:
It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.