从文件中删除停用词

Question

I want to remove stopwords from the Data column in my file. 我想从我的文件中的数据列中删除停用词。 I filtered out the line for when the end-user is speaking. 我过滤了最终用户说话时的界限。 But it doesn't filter out the stopwords with the usertext.apply(lambda x: [word for word in x if word not in stop_words]) what am i doing wrong? 但它没有使用usertext.apply(lambda x: [word for word in x if word not in stop_words])过滤掉停用usertext.apply(lambda x: [word for word in x if word not in stop_words])我做错了什么？

import pandas as pd
from stop_words  import get_stop_words
df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1")
usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']]
stop_words = get_stop_words('dutch')
clean = usertext.apply(lambda x: [word for word in x if word not in stop_words])
print(clean)

Answer 1

You can build a regex pattern of your stop words and call the vectorised str.replace to remove them: 你可以构建一个停用词的正则表达式模式，并调用vectorised str.replace来删除它们：

In [124]:
stop_words = ['a','not','the']
stop_words_pat = '|'.join(['\\b' + stop +  '\\b' for stop in stop_words])
stop_words_pat

Out[124]:
'\\ba\\b|\\bnot\\b|\\bthe\\b'

In [125]:    
df = pd.DataFrame({'text':['a to the b', 'the knot ace a']})
df['text'].str.replace(stop_words_pat, '')

Out[125]:
0         to  b
1     knot ace 
Name: text, dtype: object

here we perform a list comprehension to build a pattern surrounding each stop word with '\\b' which is a break and then we or all words using '|' 这里我们执行一个列表理解来构建一个围绕每个停止词的模式，用'\\b'表示休息，然后我们or所有单词使用'|'

Answer 2

Two issues: 两个问题：

First, you have a module called stop_words and you later create a variable named stop_words . 首先，您有一个名为stop_words的模块，稍后您将创建一个名为stop_words的变量。 This is bad form. 这是不好的形式。

Second, you are passing a lambda-function to .apply that wants its x parameter to be a list, rather than a value within a list. 其次，您将lambda函数传递给.apply ，它希望其x参数成为列表，而不是列表中的值。

That is, instead of doing df.apply(sqrt) you are doing df.apply(lambda x: [sqrt(val) for val in x]) . 也就是说，而不是执行df.apply(sqrt) ，而是执行df.apply(lambda x: [sqrt(val) for val in x]) 。

You should either do the list-processing yourself: 您应该自己进行列表处理：

clean = [x for x in usertext if x not in stop_words]

Or you should do the apply, with a function that takes one word at a time: 或者你应该使用一次只需一个单词的函数来执行apply：

clean = usertext.apply(lambda x: x if x not in stop_words else '')

As @Jean-François Fabre suggested in a comment, you can speed things up if your stop_words is a set rather than a list: 正如@ Jean-FrançoisFabre在评论中建议的那样，如果你的stop_words是一个集合而不是一个列表，你可以加快速度：

from stop_words import get_stop_words

nl_stop_words = set(get_stop_words('dutch'))    # NOTE: set

usertext = ...
clean = usertext.apply(lambda word: word if word not in nl_stop_words else '')

Answer 3

clean = usertext.apply(lambda x:  x if x not in stop_words else '')

从文件中删除停用词

问题描述

3 个解决方案

解决方案1
1 2017-03-08 14:55:42

解决方案2
1 2017-03-08 15:10:39

解决方案3
0 2017-03-08 14:40:22

从文件中删除停用词

问题描述

3 个解决方案

解决方案1 1 2017-03-08 14:55:42

解决方案2 1 2017-03-08 15:10:39

解决方案3 0 2017-03-08 14:40:22

解决方案1
1 2017-03-08 14:55:42

解决方案2
1 2017-03-08 15:10:39

解决方案3
0 2017-03-08 14:40:22