简体   繁体   中英

How to extract subset of a dataframe that has the largest maximum row value within a subset of a subset?

I have a dataframe (df) that contains datapoints along rivers, where each river basin (Basin ID) contains multiple rivers (River ID), and each river has points along its length (Length). It looks something like this (simplified):

Basin_ID River_ID Length
1 1 5
1 1 10
1 2 5
1 2 7
1 2 12
1 3 5
2 1 5
2 1 10
2 1 12
2 1 14
2 2 5

In this example Basin 1 has three rivers, and Basin 2 has two rivers. The actual table has 600K rows and 12 other columns with other river data.

Objective: I want to extract the single longest river subset for each unique basin in df, so that I end up with something like this:

Basin_ID River_ID Length
1 2 5
1 2 7
1 2 12
2 1 5
2 1 10
2 1 12
2 1 14

I assume that I would need to do something like

res = df.groupby('Basin_ID').apply(lambda x: ...

But because I need to group both by Basin_ID and River_ID, I am finding it hard to come up with an expression that works as it should. Any advice welcome, thanks!

df2=df.assign(maxRiverLength=df.groupby('Basin_ID').transform(lambda x: x.max())['Length']).set_index(['Basin_ID','River_ID'])
df.set_index(['Basin_ID','River_ID']).loc[df2[df2['Length']==df2['maxRiverLength']].index].reset_index()

There might be a smarter way, but I reproduced your output in 2 steps:

  • I first create a new column containing the maxRiverLength per Basin_ID and assign this to a df2 where I set the index to Basin_ID and River_ID:

  • I then take the original df, set the index to Basin_ID and River_ID as well and filter by the indexes of df2 having 'Length' = 'maxRiverLength'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM