简体   繁体   中英

Collapse a pandas data frame of words into sentences

My goal is to take a dataframe composed of words and tags, and collapse it into a dataframe composed of sentences and a list of tags.

Sample input:

df = pd.DataFrame([('Effect', 'O'),
               ('of', 'O'),
               ('ginseng', 'i'),
               ('extract', 'i'),
               ('supplementation', 'i'),
               ('on', 'O'),
               ('testicular', 'o'),
               ('functions', 'o'),
               ('in', 'O'),
               ('diabetic', 'p'),
               ('rats', 'p'),
               ('.', 'p'),
               ('OBJECTIVE', 'O'),
               ('It', 'O'),
               ('was', 'O')],
               columns=('token', 'annotation'))

Goal output:

df = pd.DataFrame([('Effect of ginseng extract supplementation on testicular functions in diabetic rats.', \ 
                     ['O','O','i','i','i','O','o','o','O','p','p','p','O','O','O']),
                   ('OBJECTIVE It was', ['O','O','O'])],
                   columns=('token', 'annotation'))

Sorry for the goofy example - that really is the first 15 rows of this dataset!!

Any ideas of how to compress the rows of words into rows of sentences would be much appreciated.

Use GroupBy.agg :

new_df = (df.groupby(df['token'].eq('.').shift(fill_value=False).cumsum(),
        as_index=False)
            .agg({'token' : ' '.join, 'annotation': list}))
print(new_df)
                                               token  \
0  Effect of ginseng extract supplementation on t...   
1                                   OBJECTIVE It was   

                             annotation  
0  [O, O, i, i, i, O, o, o, O, p, p, p]  
1                             [O, O, O]

If you don't want include the last point:

m = df['token'].eq('.')
new_df = (df.groupby(m.shift(fill_value=False).cumsum().loc[~m],as_index=False)
            .agg({'token' : ' '.join, 'annotation': list}))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM