简体   繁体   中英

With pandas filter rows on sum of column

I would like to select rows in a dataframe based on a sum crieteria of one of the columns. For example I want the indexes of the the first rows of the dataframe where the sum of column B is less than 3:

df = pd.DataFrame({'A':[z, y, x, w], 'B':[1, 1, 1, 1]})

The only solution I have is a seperate dataframe and a while loop:

df2 = pd.DataFrame({'A':[], 'B':[]})
index = 0
while df2['B'].sum() < 3:
    df2 = df2.append(df1.loc[index])
    index += 1

The logic gets me where I need but seems unnecessarily inefficient. Does anyone have a creative way of using pandas to filter the dataframe based on sum conditional of a column?

What you describe is a cumulative sum ( cumsum ).

Appending rows to a DataFrame within a loop is horribly inefficient as it copies the entire DataFrame on every iteration just to append an additional small amount of data. Instead you should look to slice your original DataFrame with a Boolean mask; in this case checking where the cumsum is less than 3.

df2 = df[df['B'].cumsum().lt(3)]

#   A  B
#0  z  1
#1  y  1

df['B'].cumsum()
#0    1
#1    2
#2    3
#3    4

df['B'].cumsum().lt(3)
#0     True     <- Slicing with this Boolean Series
#1     True     <- keeps only these True rows
#2    False
#3    False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM