Counting number of occurrences of string matched to another column

Question

df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted','hello there'],
'number_of_stickers':['2','0','0','1','0','0']} ##This column 'number_of_stickers' is what i am aiming to achieve. Currently, i don't have this column.

df = pd.DataFrame(data=df)

Above is what I am trying to achieve. I currently Do not have the column 'number_of_stickers'. This column would be my end goal.

I am trying to count the number of rows with "sticker omitted" and append the row above the chain of "sticker omitted" with the number of occurrences. I would like to append onto the new column 'number_of_stickers'

To give you some context, I am analysing whatsapp text data, and I thought it would be useful to see how many stickers were sent right after a chat was sent. This also shows the tonality and sentiments of the conversation.

Update:

I have posted a solution (credits to @JacoSolari) which would work for the problem I'm solving. Added 1-2 lines (if statement) on top of his codes so that we do not face a problem at the end of the dataframe (range issues).

Answer 1

It's a common technique to check for the other values and take cumsum to identify the blocks:

omitted = df.msg.ne('sticker omitted').cumsum()

df['number_of_stickers'] = np.where(omitted.duplicated(), 0,
                                    omitted.groupby(omitted).transform('size')-1)

Answer 2

You've actually got it all right so far, and your data is substantial for a easy yet functional algorithm!

Here is a little piece of code I coded up for this problem:

#ss
df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted'],
'number_of_stickers':['2','0','0','1','0']}
j = 0
newarr = [] # new array for use
for i in df["number_of_stickers"]:
    if(not int(i)==0):
       newarr.append([df["msg"][j], int(i)]) # will store each data in a array
       #access the number of it by using element 1(newarr[1]) and the msg by newarr[0]
    j+=1;
#se
#feel free to do whatever you want after ss to se

pd.DataFrame(data=df)

se being snippet end and ss snippet start.

Hope this helps! Just comment below if it doesn't!

also you have to refeed the new array to the dict.

Answer 3

This code should do the job. I could not find a solution that only uses pandas functions (it might be possible to do it). Anyways, I left some comments in the code to describe my approach.

# create data
df_dict = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted']}

df=pd.DataFrame(data=df_dict)

# build column for sticker counts after message 
sticker_counts = []
for index, row in df.iterrows(): # iterating over df rows
    flag = True
    count = 0
    # when a sticker row is encountered, just put 0 in the count column
    # when a non-sticker row is encountered do the following
    if row['msg'] != 'sticker omitted': 
        k = 1 # to check rows after the non-sticker row
        while flag:
            # if the index + k row is a sticker increase the count for index and k
            if df.loc[index + k].msg == 'sticker omitted': 
                count += 1
                k += 1
                # when reached the end of the database, break the loop
                if index + k +1 > len(df):
                    flag = False
            else:
                flag = False
                k = 1
    sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

Answer 4

I have edited @JacoSolari's codes (with the help of a kind soul) to match the needs of the problem I'm trying to solve. Please find the code below useful.

sticker_counts = []
msg_index = 0
for index, row in df.iterrows(): # iterating over df rows
    flag = True
    count = 0
    # when a sticker row is encountered, just put 0 in the count column
    # when a non-sticker row is encountered do the following
    if row['msg'] != 'sticker omitted': 
        k = 1 # to check rows after the non-sticker row
        while flag:
            print(f'i{msg_index} flag{flag} len{len(df)}')
            # if the index + k row is a sticker increase the count for index and k
            msg_index=index + k 
            if msg_index >= len(df):
                break
            if df.loc[msg_index].msg == 'sticker omitted': 
                count += 1
                k += 1
                # when reached the end of the database, break the loop
                if msg_index +1 > len(df):
                    flag = False
                    print(f'i{msg_index} flag{flag}')
            else:
                flag = False
                k = 1                
    sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

Counting number of occurrences of string matched to another column

Question

Update:

4 answers

solution1
1 2020-09-22 14:45:42

solution2
0 2020-09-22 14:14:43

solution3
0 ACCPTED 2020-09-22 15:25:39

solution4
0 2020-09-23 01:25:52

Counting number of occurrences of string matched to another column

Question

Update:

4 answers

solution1 1 2020-09-22 14:45:42

solution2 0 2020-09-22 14:14:43

solution3 0 ACCPTED 2020-09-22 15:25:39

solution4 0 2020-09-23 01:25:52

solution1
1 2020-09-22 14:45:42

solution2
0 2020-09-22 14:14:43

solution3
0 ACCPTED 2020-09-22 15:25:39

solution4
0 2020-09-23 01:25:52