將評論數據框拆分為多行

Question

我有一個帶有長評論的數據框，我想使用 spacy sentencizer 將它們分成單獨的句子。

Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')  
Comments
>>>
         reviews
    0    One of the rare films where every discussion leaving the theater is about how much you 
         just had, instead of an analysis of its quotients.
    1    Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving, 
         and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that 
         re-watchability factor.

我像這樣加載了 model

import spacy
nlp = spacy.load("en_core_news_sm")

並使用sentencizer

from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))

但是當我檢查句子是這樣的一行時

[One of the rare films where every discussion leaving the theater is about how much you just had.,
 Instead of an analysis of its quotients.]

非常感謝您的幫助。 我是在數據框中使用 NLP 工具的新手。

Answer 1

目前， Data是一個Series ，其行是句子列表，或者實際上是 Spacy 的Span對象列表。 您可能想要獲取這些句子的文本並將每個句子放在不同的行上。

comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
            {'reviews': 'This is the first sentence of the second review. And this is the second.'}]

comments = pd.DataFrame(comments) # building your input DataFrame

+----+--------------------------------------------------------------------------+
|    | reviews                                                                  |
|----+--------------------------------------------------------------------------|
|  0 | This is the first sentence of the first review. And this is the second.  |
|  1 | This is the first sentence of the second review. And this is the second. |
+----+--------------------------------------------------------------------------+

現在讓我們定義一個 function，給定一個字符串，將其句子列表作為文本（字符串）返回。

def obtain_sentences(s):
    doc = nlp(s)
    sents = [sent.text for sent in doc.sents]
    return sents

可以將 function 應用於comments DataFrame以生成包含句子的新DataFrame 。

data = comments.copy()
data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
data = data.explode('reviews').reset_index(drop=True)
data

我使用explode將句子列表的元素轉換為行。

這是獲得的輸出！

+----+--------------------------------------------------+
|    | reviews                                          |
|----+--------------------------------------------------|
|  0 | This is the first sentence of the first review.  |
|  1 | And this is the second.                          |
|  2 | This is the first sentence of the second review. |
|  3 | And this is the second.                          |
+----+--------------------------------------------------+

將評論數據框拆分為多行

問題描述

1 個解決方案

解決方案1
1 已采納 2022-07-30 12:31:25

將評論數據框拆分為多行

問題描述

1 個解決方案

解決方案1 1 已采納 2022-07-30 12:31:25

解決方案1
1 已采納 2022-07-30 12:31:25