With pandas filter rows on sum of column

Question

I would like to select rows in a dataframe based on a sum crieteria of one of the columns. For example I want the indexes of the the first rows of the dataframe where the sum of column B is less than 3:

df = pd.DataFrame({'A':[z, y, x, w], 'B':[1, 1, 1, 1]})

The only solution I have is a seperate dataframe and a while loop:

df2 = pd.DataFrame({'A':[], 'B':[]})
index = 0
while df2['B'].sum() < 3:
    df2 = df2.append(df1.loc[index])
    index += 1

The logic gets me where I need but seems unnecessarily inefficient. Does anyone have a creative way of using pandas to filter the dataframe based on sum conditional of a column?

Answer 1

What you describe is a cumulative sum ( cumsum ).

Appending rows to a DataFrame within a loop is horribly inefficient as it copies the entire DataFrame on every iteration just to append an additional small amount of data. Instead you should look to slice your original DataFrame with a Boolean mask; in this case checking where the cumsum is less than 3.

df2 = df[df['B'].cumsum().lt(3)]

#   A  B
#0  z  1
#1  y  1

df['B'].cumsum()
#0    1
#1    2
#2    3
#3    4

df['B'].cumsum().lt(3)
#0     True     <- Slicing with this Boolean Series
#1     True     <- keeps only these True rows
#2    False
#3    False

With pandas filter rows on sum of column

Question

1 answers

solution1
2 ACCPTED 2021-02-11 22:48:28

With pandas filter rows on sum of column

Question

1 answers

solution1 2 ACCPTED 2021-02-11 22:48:28

solution1
2 ACCPTED 2021-02-11 22:48:28