简体   繁体   English

如何使用 spacy 在 pandas 数据框中删除停用词并获取引理?

[英]How to remove stop words and get lemmas in a pandas data frame using spacy?

I have a column of tokens in a pandas data frame in python.我在 python 的 pandas 数据框中有一列令牌。 Something that looks like:看起来像:

 word_tokens
 (the,cheeseburger,was,great)
 (i,never,did,like,the,pizza,too,much)
 (yellow,submarine,was,only,an,ok,song)

I want get two more new columns in this dataframe using the spacy library.我想使用 spacy 库在此 dataframe 中再获得两个新列。 One column that contains each row's tokens with the stopwords removed, and the other one containing the lemmas from the second column.一列包含删除了停用词的每一行的标记,另一列包含第二列中的引理。 How could I do that?我怎么能那样做?

You're right about making your text a spaCy type - you want to transform every tuple of tokens into a spaCy Doc.您将文本设置为 spaCy 类型是正确的 - 您希望将每个标记元组转换为 spaCy Doc。 From there, it is best to use the attributes of the tokens to answer the questions of "is the token a stop word" (use token.is_stop ), or "what is the lemma of this token" (use token.lemma_ ).从那里,最好使用标记的属性来回答“标记是停用词”(使用token.is_stop )或“这个标记的引理是什么”(使用token.lemma_ )的问题。 My implementation is below, I altered your input data slightly to include some examples of plurals so you can see that the lemmatization works properly.我的实现如下,我稍微更改了您的输入数据以包含一些复数示例,以便您可以看到词形还原正常工作。

import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm')

texts = [('the','cheeseburger','was','great'),
         ('i','never','did','like','the','pizzas','too','much'), 
         ('yellowed','submarines','was','only','an','ok','song')]

df = pd.DataFrame({'word_tokens': texts})

The initial DataFrame looks like this:最初的 DataFrame 如下所示:

word_tokens word_tokens
0 0 ('the', 'cheeseburger', 'was', 'great') ('the'、'cheeseburger'、'was'、'great')
1 1 ('i', 'never', 'did', 'like', 'the', 'pizzas', 'too', 'much') (“我”、“从不”、“做过”、“喜欢”、“那个”、“比萨饼”、“太”、“很多”)
2 2 ('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song') ('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song')

I define functions to perform the main tasks:我定义了执行主要任务的函数:

  1. tuple of tokens -> spaCy Doc令牌元组-> spaCy Doc
  2. spaCy Doc -> list of non-stop words spaCy Doc -> 非停用词列表
  3. spaCy Doc -> list of non-stop, lemmatized words spaCy Doc -> 不间断的词形化词列表
def to_doc(words:tuple) -> spacy.tokens.Doc:
    # Create SpaCy documents by joining the words into a string
    return nlp(' '.join(words))

def remove_stops(doc) -> list:
    # Filter out stop words by using the `token.is_stop` attribute
    return [token.text for token in doc if not token.is_stop]

def lemmatize(doc) -> list:
    # Take the `token.lemma_` of each non-stop word
    return [token.lemma_ for token in doc if not token.is_stop]

Applying these looks like:应用这些看起来像:

# create documents for all tuples of tokens
docs = list(map(to_doc, df.word_tokens))

# apply removing stop words to all
df['removed_stops'] = list(map(remove_stops, docs))

# apply lemmatization to all
df['lemmatized'] = list(map(lemmatize, docs))

The output you get should look like this:您得到的 output 应该如下所示:

word_tokens word_tokens removed_stops已移除_stops lemmatized词形化
0 0 ('the', 'cheeseburger', 'was', 'great') ('the'、'cheeseburger'、'was'、'great') ['cheeseburger', 'great'] ['芝士汉堡','很棒'] ['cheeseburger', 'great'] ['芝士汉堡','很棒']
1 1 ('i', 'never', 'did', 'like', 'the', 'pizzas', 'too', 'much') (“我”、“从不”、“做过”、“喜欢”、“那个”、“比萨饼”、“太”、“很多”) ['like', 'pizzas'] ['喜欢','披萨'] ['like', 'pizza'] ['喜欢','披萨']
2 2 ('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song') ('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song') ['yellowed', 'submarines', 'ok', 'song'] ['yellowed', 'submarines', 'ok', 'song'] ['yellow', 'submarine', 'ok', 'song'] ['黄色','潜艇','ok','歌曲']

Based on your use case, you may want to explore other attributes of spaCy's document object ( https://spacy.io/api/doc ).根据您的用例,您可能想要探索 spaCy 文档 object ( https://spacy.io/api/doc ) 的其他属性。 Particularly, take a look at doc.noun_chunks and doc.ents if you're trying to extract more meaning out of text.特别是,如果您想从文本中提取更多含义,请查看doc.noun_chunksdoc.ents

It is also worth noting that if you plan on using this with a very large number of texts, you should consider nlp.pipe : https://spacy.io/usage/processing-pipelines .还值得注意的是,如果您打算将其用于大量文本,则应考虑nlp.pipehttps://spacy.io It processes your documents in batches instead of one by one, and could make your implementation more efficient.它分批而不是一个一个地处理您的文档,并且可以使您的实施更有效率。

If you are working with spacy you should make your text a spacy type, so something like this:如果你正在使用 spacy,你应该让你的文本成为 spacy 类型,所以像这样:

 nlp = spacy.load("en_core_web_sm")
 text = topic_data['word_tokens'].values.tolist()
 text = '.'.join(map(str, text))
 text = nlp(text)

This makes it easier to work with.这使它更容易使用。 You can then tokenize the words like this然后,您可以像这样标记单词

 token_list = []
    for token in text:
    token_list.append(token.text)

And Remove stop words like so.并像这样删除停用词。

token_list= [word for word in token_list if not word in nlp.Defaults.stop_words]

I haven't yet figured out the lemmatization part yet, but this is a start till then.我还没有弄清楚词形还原部分,但这是一个开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM