简体   繁体   中英

How to drop level 1 indices based on conditions for a MultiIndex DataFrame

My goal is to drop the two stocks/tickers (or bottom decile) based on the volume (drop the entire row for 2 stocks/tickers (level 1) with the lowest volumes for each level 0 index Date)

The DataFrame has already been sorted by volume so for each date, its sorted by volume in ascending order. So the DataFrame would might look like ( has been shortened to 5 stocks instead of 20 ): Example of DataFrame: '''

Date        Ticker  col1 col2 col3 Volume
2020-01-01  stock1   -    -    -    5
            stock2   -    -    -    10
            stock3   -    -    -    20
            stock4   -    -    -    40
            stock5   -    -    -    43
2020-02-01  stock3   -    -    -    7
            stock5   -    -    -    14
            stock1   -    -    -    33
            stock2   -    -    -    50
            stock4   -    -    -    52

For level 0 index Date "2020-01-01", I would want to drop stock1 and stock2, but for the next level index Date "2020-02-01" I want to drop the new lowest 2 which are stock3 and stock5.

Note: the real DataFrame will be much bigger with more than just 5 stocks and for many more months

So far I have tried adding a decile column ( since my real goal is to do this for 20 stocks ), using qcut ,which will automatically give me the lowest two values by volume, but I wasn't able to replicated for EACH level 0 date (was only successful for one date and not sure how to do it for each level 0 date).

I also tried nsmallest and nlargest but encountered errors due to this being a DataFrame.

Do you have any suggestions as to how I can do this task? I feel as though I'm on the right path but I am missing something basic. Any insight is appreciated!

Since your DataFrame is already sorted by date and volume, you can drop the first 2 rows from each date group by adapting any of the answers to Python: Pandas - Delete the first row by group . For example:

# Create input data based on your example
d = {'Date': 5 * ['2020-01-01'] + 
             5 * ['2020-02-01'],
   'Ticker': ['stock1', 'stock2', 'stock3', 'stock4', 'stock5',
              'stock3', 'stock5', 'stock1', 'stock2', 'stock4'],
     'col1': 10 * ['-'],
   'Volume': [5, 10, 20, 40, 43, 7, 14, 33, 50, 52]}

df = pd.DataFrame(d)

# Get the first and second rows of each date group
to_del = df.groupby('Date', as_index=False).nth([0, 1])

# Intentionally duplicate the first and second rows of each date
# group, then remove them with drop_duplicates with keep=False to 
# drop *all* duplicated rows without keeping first occurrences
res = pd.concat([df, to_del]).drop_duplicates(keep=False)

print(res)

         Date  Ticker col1  Volume
2  2020-01-01  stock3    -      20
3  2020-01-01  stock4    -      40
4  2020-01-01  stock5    -      43
7  2020-02-01  stock1    -      33
8  2020-02-01  stock2    -      50
9  2020-02-01  stock4    -      52

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM