Pandas dataframe split or groupby dataframe at each occurence of value (True) in column

Question

a have a df like this:

df = pd.DataFrame({'words':['hi', 'this', 'is', 'a', 'sentence', 'this', 'is', 'another', 'sentence'], 'indicator':[1,0,0,0,0,1,0,0,0]})

which gives me:

    words  indicator
0        hi          1
1      this          0
2        is          0
3         a          0
4  sentence          0
5      this          1
6        is          0
7   another          0
8  sentence          0

Now I want to merge all values of column 'words', that follow the '1' in indicator until the next '1' comes up. Something like this would be the ideal result:

                      words  indicator  counter
0     hi this is a sentence          1        5
1  this is another sentence          1        4

It's not that easy to explain, that's why I rely on this example. I tried groupby and split, but couldn't get to a solution. Last try would be to set up some kind of df.iterrows(), but I want to avoid this for now since the actual df is quite large.

Thanks in advance for any help!

Answer 1

You can get the cumulative sum of your indicator, then groupby that to join all the words together on a space and count the number of words in each sentence.

df["indicator"] = df["indicator"].cumsum()
df = df.groupby(
    "indicator", as_index=False
).agg(
    words=("words", " ".join), 
    counter=("indicator", "size")
)
#    indicator                     words  counter
# 0          1     hi this is a sentence        5
# 1          2  this is another sentence        4

Pandas dataframe split or groupby dataframe at each occurence of value (True) in column

Question

1 answers

solution1
2 ACCPTED 2021-07-08 12:37:21

Pandas dataframe split or groupby dataframe at each occurence of value (True) in column

Question

1 answers

solution1 2 ACCPTED 2021-07-08 12:37:21

solution1
2 ACCPTED 2021-07-08 12:37:21