從文本中剝離專有名詞

Question

我有一個包含數千行文本數據的 df。 我正在使用 spaCy 在該 df 的單列上執行一些 NLP，並嘗試使用以下內容從我的文本數據中刪除專有名詞、停用詞和標點符號：

tokens = []
lemma = []
pos = []

for doc in nlp.pipe(df['TIP_all_txt'].astype('unicode').values, batch_size=9845,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
        lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
        pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
    else:
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

df['s_tokens_all_txt'] = tokens
df['s_lemmas_all_txt'] = lemma
df['s_pos_all_txt'] = pos

df.head()

但是我收到這個錯誤，我不知道為什么：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-73578fd46847> in <module>()
      6                         n_threads=3):
      7     if doc.is_parsed:
----> 8         tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
      9         lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
     10         pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])

<ipython-input-34-73578fd46847> in <listcomp>(.0)
      6                         n_threads=3):
      7     if doc.is_parsed:
----> 8         tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
      9         lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
     10         pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])

AttributeError: 'spacy.tokens.token.Token' object has no attribute 'is_propn'

如果我取出 not n.is_propn 代碼會按預期運行。 我已經搜索並閱讀了 spaCy 文檔，但到目前為止還沒有找到答案。

Answer 1

我沒有看到Token對象上的is_propn屬性可用。

我認為您應該檢查詞性類型為PROPN （參考）：

from spacy.parts_of_speech import PROPN

def is_proper_noun(token):
    if token.doc.is_tagged is False:  # check if the document was POS-tagged
        raise ValueError('token is not POS-tagged')

    return token.pos == PROPN

Answer 2

添加到@alecxe 答案。

沒有必要

一次性填充所有數據幀行。
填充數據框時獲取單獨的標記、引理和 pos 列表。

你可以試試：

df = pd.DataFrame(columns=['tokens', 'lemmas', 'pos'])

annotated_docs = nlp.pipe(df['TIP_all_txt'].astype('unicode').values,
                          batch_size=9845, n_threads=3)

for doc in annotated_docs:
    if doc.is_parsed:
        # Remove the tokens that you don't want.
        tokens, lemmas, pos = zip(*[(tok.text, tok.lemma_, tok.pos_) 
                                    for tok in doc if not
                                    (tok.is_punct or tok.is_stop 
                                     or tok.is_space or is_proper_noun(tok) )
                                   ]
                                  )
        # Populate the DataFrame.
        df.append({'tokens':tokens, 'lemmas':lemmas, 'pos':pos})

這里有一個更簡潔的 Pandas 技巧，它來自如何在 Pandas 數據框中拆分元組列？ 但數據幀會占用更多內存：

df = pd.DataFrame(columns=['Tokens'])

annotated_docs = nlp.pipe(df['TIP_all_txt'].astype('unicode').values,
                          batch_size=9845, n_threads=3)

for doc in annotated_docs:
    if doc.is_parsed:
        # Remove the tokens that you don't want.
        df.append([(tok.text, tok.lemma_, tok.pos_) 
                    for tok in doc if not
                    (tok.is_punct or tok.is_stop 
                     or tok.is_space or is_proper_noun(tok) )
                   ]
                  )

df[['tokens', 'lemmas', 'pos']] = df['Tokens'].apply(pd.Series)

Answer 3

from nltk.tag import pos_tag
def proper_nouns():
    tagged_sent = pos_tag(speech.split())
    pn = [word for word,pos in tagged_sent if pos == 'NNP']
    pn = [x.lower() for x in pn]
    prn=list(set(pn))
    prn= pd.DataFrame({'b_words':prn,'bucket_name':'proper noun'})
    return prn
df=proper_nouns()

在這里演講將成為您的文字！

從文本中剝離專有名詞

問題描述

3 個解決方案

解決方案1
5 已采納 2018-01-04 21:06:59

解決方案2
2 2018-01-06 18:07:08

解決方案3
0 2020-04-12 12:25:45

從文本中剝離專有名詞

問題描述

3 個解決方案

解決方案1 5 已采納 2018-01-04 21:06:59

解決方案2 2 2018-01-06 18:07:08

解決方案3 0 2020-04-12 12:25:45

解決方案1
5 已采納 2018-01-04 21:06:59

解決方案2
2 2018-01-06 18:07:08

解決方案3
0 2020-04-12 12:25:45