简体   繁体   中英

Concatenate strings with pandas GroupBy based on ordering from another column

My dataframe has the following data

callerid  seq   text
1236     2      I need to talk to x
1236     6      Issue 3 is this
1236     3      This is regarding abc
1236     5      Issue 2 is this
1236     4      Issue 1 is this
1236     1      Hi
1347     2      I need to talk to x
1347     6      Issue 3 is this
1347     3      This is regarding abc
1347     5      Issue 2 is this
1347     4      Issue 1 is this
1347     1      Hi

I need to group the data by callerid,sort by the seq, concat text and write to another dataframe

The final output data should look like this

callerid        text    
1236            Hi I need to talk to X This is regarding abc Issue 1 is this Issue 2 is this Issue 3 is this    
1347            Hi I need to talk to X This is regarding abc Issue 1 is this Issue 2 is this Issue 3 is this

I tried the following code

documentext = dataextract.sort_values(['callerid','seq']).groupby('callerid')

documenttext1 = documenttext[['callerid','text']]
documentext1 = (documenttext1.groupby('callerid')['text']
       .apply(lambda x: ' '.join(set(x.dropna())))
       .reset_index())

The first statement is not giving me the complete sorted text This is the output I get

callerid seq   text
1236     1     Hi
1236     2     I need to talk to x
1236     3     This is regarding abc
1347     1     Hi
1347     2     I need to talk to x
1347     3     This is regarding abc

Appreciate any help on this

Thanks in advance

As you guessed, the first step is to sort, the second is to group. You can use ' '.join as the aggfunc to concatenate your strings.

(df.sort_values('seq')
   .groupby('callerid', sort=False)['text']
   .agg(' '.join)
   .reset_index())

   callerid                                               text
0      1236  Hi I need to talk to x This is regarding abc I...
1      1347  Hi I need to talk to x This is regarding abc I...

You shouldn't group over "seq" since you're trying to aggregate across it.

More like the index sum

(' '+df.set_index(['callerid','seq']).\
   sort_index([0,1]).text).\
      sum(level=0,axis=0).str.strip().reset_index()  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM