简体   繁体   中英

Counting number of occurrences of string matched to another column

df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted','hello there'],
'number_of_stickers':['2','0','0','1','0','0']} ##This column 'number_of_stickers' is what i am aiming to achieve. Currently, i don't have this column.

df = pd.DataFrame(data=df)

Above is what I am trying to achieve. I currently Do not have the column 'number_of_stickers'. This column would be my end goal.

I am trying to count the number of rows with "sticker omitted" and append the row above the chain of "sticker omitted" with the number of occurrences. I would like to append onto the new column 'number_of_stickers'

To give you some context, I am analysing whatsapp text data, and I thought it would be useful to see how many stickers were sent right after a chat was sent. This also shows the tonality and sentiments of the conversation.

Update:

I have posted a solution (credits to @JacoSolari) which would work for the problem I'm solving. Added 1-2 lines (if statement) on top of his codes so that we do not face a problem at the end of the dataframe (range issues).

It's a common technique to check for the other values and take cumsum to identify the blocks:

omitted = df.msg.ne('sticker omitted').cumsum()

df['number_of_stickers'] = np.where(omitted.duplicated(), 0,
                                    omitted.groupby(omitted).transform('size')-1)

You've actually got it all right so far, and your data is substantial for a easy yet functional algorithm!

Here is a little piece of code I coded up for this problem:

#ss
df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted'],
'number_of_stickers':['2','0','0','1','0']}
j = 0
newarr = [] # new array for use
for i in df["number_of_stickers"]:
    if(not int(i)==0):
       newarr.append([df["msg"][j], int(i)]) # will store each data in a array
       #access the number of it by using element 1(newarr[1]) and the msg by newarr[0]
    j+=1;
#se
#feel free to do whatever you want after ss to se

pd.DataFrame(data=df)

se being snippet end and ss snippet start.

Hope this helps! Just comment below if it doesn't!

also you have to refeed the new array to the dict.

This code should do the job. I could not find a solution that only uses pandas functions (it might be possible to do it). Anyways, I left some comments in the code to describe my approach.

# create data
df_dict = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted']}

df=pd.DataFrame(data=df_dict)

# build column for sticker counts after message 
sticker_counts = []
for index, row in df.iterrows(): # iterating over df rows
    flag = True
    count = 0
    # when a sticker row is encountered, just put 0 in the count column
    # when a non-sticker row is encountered do the following
    if row['msg'] != 'sticker omitted': 
        k = 1 # to check rows after the non-sticker row
        while flag:
            # if the index + k row is a sticker increase the count for index and k
            if df.loc[index + k].msg == 'sticker omitted': 
                count += 1
                k += 1
                # when reached the end of the database, break the loop
                if index + k +1 > len(df):
                    flag = False
            else:
                flag = False
                k = 1
    sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

在此处输入图片说明

I have edited @JacoSolari's codes (with the help of a kind soul) to match the needs of the problem I'm trying to solve. Please find the code below useful.

sticker_counts = []
msg_index = 0
for index, row in df.iterrows(): # iterating over df rows
    flag = True
    count = 0
    # when a sticker row is encountered, just put 0 in the count column
    # when a non-sticker row is encountered do the following
    if row['msg'] != 'sticker omitted': 
        k = 1 # to check rows after the non-sticker row
        while flag:
            print(f'i{msg_index} flag{flag} len{len(df)}')
            # if the index + k row is a sticker increase the count for index and k
            msg_index=index + k 
            if msg_index >= len(df):
                break
            if df.loc[msg_index].msg == 'sticker omitted': 
                count += 1
                k += 1
                # when reached the end of the database, break the loop
                if msg_index +1 > len(df):
                    flag = False
                    print(f'i{msg_index} flag{flag}')
            else:
                flag = False
                k = 1                
    sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM