将评论数据框拆分为多行

Question

I have a data frame with long comments and I want to split them into indiviual sentences using spacy sentencizer.我有一个带有长评论的数据框，我想使用 spacy sentencizer 将它们分成单独的句子。

Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')  
Comments
>>>
         reviews
    0    One of the rare films where every discussion leaving the theater is about how much you 
         just had, instead of an analysis of its quotients.
    1    Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving, 
         and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that 
         re-watchability factor.

I loaded the model like this我像这样加载了 model

import spacy
nlp = spacy.load("en_core_news_sm")

And using sentencizer并使用sentencizer

from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))

But when I check the sentence is in just one row like this但是当我检查句子是这样的一行时

[One of the rare films where every discussion leaving the theater is about how much you just had.,
 Instead of an analysis of its quotients.]

Thanks a lot for any help.非常感谢您的帮助。 I'm new using NLP tools in Data Frame.我是在数据框中使用 NLP 工具的新手。

Answer 1

Currently, Data is a Series whose rows are lists of sentences, or actually, lists of Spacy's Span objects.目前， Data是一个Series ，其行是句子列表，或者实际上是 Spacy 的Span对象列表。 You probably want to obtain the text of these sentences and to put each sentence on a different row.您可能想要获取这些句子的文本并将每个句子放在不同的行上。

comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
            {'reviews': 'This is the first sentence of the second review. And this is the second.'}]

comments = pd.DataFrame(comments) # building your input DataFrame

+----+--------------------------------------------------------------------------+
|    | reviews                                                                  |
|----+--------------------------------------------------------------------------|
|  0 | This is the first sentence of the first review. And this is the second.  |
|  1 | This is the first sentence of the second review. And this is the second. |
+----+--------------------------------------------------------------------------+

Now let's define a function which, given a string, returns the list of its sentences as texts (strings).现在让我们定义一个 function，给定一个字符串，将其句子列表作为文本（字符串）返回。

def obtain_sentences(s):
    doc = nlp(s)
    sents = [sent.text for sent in doc.sents]
    return sents

The function can be applied to the comments DataFrame to produce a new DataFrame containing sentences.可以将 function 应用于comments DataFrame以生成包含句子的新DataFrame 。

data = comments.copy()
data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
data = data.explode('reviews').reset_index(drop=True)
data

I used explode to transform the elements of the lists of sentences into rows.我使用explode将句子列表的元素转换为行。

And this is the obtained output!这是获得的输出！

+----+--------------------------------------------------+
|    | reviews                                          |
|----+--------------------------------------------------|
|  0 | This is the first sentence of the first review.  |
|  1 | And this is the second.                          |
|  2 | This is the first sentence of the second review. |
|  3 | And this is the second.                          |
+----+--------------------------------------------------+

将评论数据框拆分为多行

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-30 12:31:25

将评论数据框拆分为多行

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-30 12:31:25

解决方案1
1 已采纳 2022-07-30 12:31:25