如何使用自定义停用词词典从数据框列中删除英语停用词

Question

I'm writing a function that takes in a dataframe(df) of tweets as input.我正在编写一个函数，它将推文的数据帧（df）作为输入。 I need to tokenize the tweets and remove the stop words and add this output to a new column.我需要标记推文并删除停用词并将此输出添加到新列。 I can't import anything except numpy and pandas.除了 numpy 和 pandas 之外，我不能导入任何东西。

The stop words are in a dictionary as follows:停用词在字典中如下：

stop_words_dict = {
'stopwords':[
    'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon', 
    'may', 'why', 'â€™s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former', 
    'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through', 
    'seeming', 'hence', 'us', 'anywhere....}

This is what I attempted to do: A function to remove the stop words这就是我试图做的：删除停用词的函数

def stop_words_remover(df):
    stop_words = list(stop_words_dict.values())
    df["Without Stop Words"] = df["Tweets"].str.lower().str.split()
    df["Without Stop Words"] = df["Without Stop Words"].apply(lambda x: [word for word in x if word not in stop_words])
    return df

So if this was my input:所以如果这是我的输入：

 [@bongadlulane, please, send, an, email, to,]

This is the expected output:这是预期的输出：

[@bongadlulane, send, email, mediadesk@eskom.c]

but I keep returning the former instead of the latter但我一直返回前者而不是后者

Any insight would be really appreciated.任何见解将不胜感激。 Thank you谢谢

Answer 1

Your problem is in this line:你的问题在这一行：

stop_words = list(stop_words_dict.values())

This returns a list of the list of stop words这将返回停用词列表的列表

Replace it by:替换为：

stop_words = stop_words_dict['stopwords']

如何使用自定义停用词词典从数据框列中删除英语停用词

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-01 12:46:25

如何使用自定义停用词词典从数据框列中删除英语停用词

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-01 12:46:25

解决方案1
1 已采纳 2020-04-01 12:46:25