简体   繁体   中英

Conditional merging of dataframe rows

I have a 2xN dataframe of chat messages, and I am trying to find the cleanest way to merge consecutive messages that originate from the same speaker. Here is a sample of the data I am working with:

mydata = pd.DataFrame(data=[['A','random text'],
                            ['B','random text'],
                            ['A','random text'],
                            ['A','random text'],
                            ['A','random text'],
                            ['B','random text'],
                            ['A','random text'],
                            ['B','random text'],
                            ['B','random text'],
                            ['A','random text']], columns=['speaker','message'])

Hopefully you can see that the order of speakers is not in an ABAB format as I would like. Instead, there are some sequences of AAAB and ABBA. My current thinking is to rebuild the dataframe from scratch, checking the ID of each row with the ID of the next index position...

mergeCheck = True
while mergeCheck is True:
    # set length of the dataframe
    lenDF = len(mydata)
# empty list to rebuild dataframe
mergeDF = []
# set index position at the beginning of dataframe
i = 0            
while i < lenDF-1:
   # check whether adjacent rows have different ID
   if mydata['speaker'].iloc[i] != mydata['speaker'].iloc[i+1]:
       # if true, append row as is to mergeDF list
       mergeDF.append([mydata['speaker'].iloc[i],
                       mydata['message'].iloc[i]])
       # increase index position by 1
       i +=1
   else:
       # merge messages
       mergeDF.append([mydata['speaker'].iloc[i],
                       mydata['message'].iloc[i] + mydata['message'].iloc[i+1]])
       # increase index position by 2
       i +=2
# exit the loop if index position falls on the last message
if i == lenDF-1: 
    # if true, append row as is to mergeDF list
    mergeDF.append([mydata['speaker'].iloc[i],
                    mydata['message'].iloc[i]])
    # increase counter by 1
    i +=1
if i == lenDF:
    mergeCheck = False

However, this only works for two adjacent messages. Returning to my original data, when put into a dataframe, the above function generates the following output...

--------------------------
  speaker  |   message
--------------------------
    A         'random text'
    B         'random text'
    A         'random textrandom text'
    A         'random text'
    B         'random text'
    A         'random text'
    B         'random textrandom text'
    A         'random text'
--------------------------

I have thought to extend the function to check more comparisons of i (ie does '.iloc[i].=,iloc[i+2]'. or '.iloc[i].=,iloc[i+3]' etc.). but this gets unworkable really quickly. What I think I need is some way to repeat the above function until the dataframe is in the desired format. But I'm unsure how to go about this.

A possible solution is this:

df1 = mydata[mydata['speaker']=='A'].reset_index()
df2= mydata[mydata['speaker']=='B'].reset_index()
df = pd.concat([df1, df2]).sort_index()

which returns

  index speaker      message
0      0       A  random text
0      1       B  random text
1      2       A  random text
1      5       B  random text
2      3       A  random text
2      7       B  random text
3      4       A  random text
3      8       B  random text
4      6       A  random text
5      9       A  random tex

if you have a timmestamp to these, remember to sort by time/date before resetting the index. Also, when concatenating beware of time.

After your clarificationin the comments, I suggest this. Create first a key that matches equal entities (A, B) and then group by speakers and entities (keys)

df['key'] = (df['speaker'] != df['speaker'].shift(1)).astype(int).cumsum()

which gives

  speaker      message  key
0       A  random text    1
1       B  random text    2
2       A  random text    3
3       A  random text    3
4       A  random text    3
5       B  random text    4
6       A  random text    5
7       B  random text    6
8       B  random text    6
9       A  random text    7

Now, you simply need to groupby

df = df.groupby(['key', 'speaker'])['message'].apply(' '.join)
df

which gives

key  speaker
1    A                                  random text
2    B                                  random text
3    A          random text random text random text
4    B                                  random text
5    A                                  random text
6    B                      random text random text
7    A                                  random text

After some exploring, I have come up with a better solution than my OP. I will detail that here for anyone experiencing a similar issue. I will refrain from accepting my own answer for the time-being in case someone comes up with a better option.

# compare each row with the previous
mydata['prev_speaker'] = mydata['speaker'].shift(1).mask(pd.isnull, mydata['speaker'])

# boolean value to determine whether current speaker differs from previous
mydata['speaker_change'] = np.where(mydata['speaker'] != mydata['prev_speaker'], 'True','False')

# empty list to record changes in speaker
counterList = []    

# initialize a counter to loop through dataframe
counter =1

# loop through dataframe, increasing counter by 1 if the speaker changes
for row in mydata['speaker_change']:
    if row == 'False':
        counterList.append(counter)
    else:
        counter+=1
        counterList.append(counter)

# add counterList to dataframe
mydata['chunking'] = counterList

# group the original message based on the chunking variable
mydata['message'] = mydata.groupby(['chunking'])['message'].transform(lambda x: ' '.join(x))

# drop duplicate rows based on message content and chunking
mydata = mydata.drop_duplicates(subset=['message','chunking'])

# drop non-needed columns
mydata = mydata.drop(['prev_speaker','speaker_change','chunking'], axis=1)

Which now gives me the following:

|---------------------|-------------------------------------|
|       Speaker       |               Message               |
|---------------------|-------------------------------------|
|          A          |             random text             |
|---------------------|-------------------------------------|
|          B          |             random text             |
|---------------------|-------------------------------------|
|          A          | random text random text random text |
|---------------------|-------------------------------------|
|          B          |             random text             |
|---------------------|-------------------------------------|
|          A          |             random text             |
|---------------------|-------------------------------------|
|          B          |       random text random text       |
|---------------------|-------------------------------------|
|          A          |             random text             |
|---------------------|-------------------------------------|

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM