简体   繁体   中英

Split data frame of comments into multiple rows

I have a data frame with long comments and I want to split them into indiviual sentences using spacy sentencizer.

Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')  
Comments
>>>
         reviews
    0    One of the rare films where every discussion leaving the theater is about how much you 
         just had, instead of an analysis of its quotients.
    1    Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving, 
         and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that 
         re-watchability factor.

I loaded the model like this

import spacy
nlp = spacy.load("en_core_news_sm")

And using sentencizer

from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))

But when I check the sentence is in just one row like this

[One of the rare films where every discussion leaving the theater is about how much you just had.,
 Instead of an analysis of its quotients.]

Thanks a lot for any help. I'm new using NLP tools in Data Frame.

Currently, Data is a Series whose rows are lists of sentences, or actually, lists of Spacy's Span objects. You probably want to obtain the text of these sentences and to put each sentence on a different row.

comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
            {'reviews': 'This is the first sentence of the second review. And this is the second.'}]

comments = pd.DataFrame(comments) # building your input DataFrame
+----+--------------------------------------------------------------------------+
|    | reviews                                                                  |
|----+--------------------------------------------------------------------------|
|  0 | This is the first sentence of the first review. And this is the second.  |
|  1 | This is the first sentence of the second review. And this is the second. |
+----+--------------------------------------------------------------------------+

Now let's define a function which, given a string, returns the list of its sentences as texts (strings).

def obtain_sentences(s):
    doc = nlp(s)
    sents = [sent.text for sent in doc.sents]
    return sents

The function can be applied to the comments DataFrame to produce a new DataFrame containing sentences.

data = comments.copy()
data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
data = data.explode('reviews').reset_index(drop=True)
data

I used explode to transform the elements of the lists of sentences into rows.

And this is the obtained output!

+----+--------------------------------------------------+
|    | reviews                                          |
|----+--------------------------------------------------|
|  0 | This is the first sentence of the first review.  |
|  1 | And this is the second.                          |
|  2 | This is the first sentence of the second review. |
|  3 | And this is the second.                          |
+----+--------------------------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM