简体   繁体   English

从Python的字符串列中删除停用词

[英]Removing Stopwords from Column of Strings in Python

I'm working on a project to read the text and make a prediction of the outcome. 我正在做一个项目,以阅读文本并预测结果。 As part of cleaning the data I am trying to remove all of the stopwords. 作为清理数据的一部分,我试图删除所有停用词。 When I try to do this, I need the output to be in a datafram format but I am running into issues there. 当我尝试执行此操作时,我需要输出为datafram格式,但是在那里遇到了问题。

So, after much cleaning I got the data to the point where it looks like this. 因此,经过大量清理,我得到的数据看起来像这样。 在此处输入图片说明

The labels are in a different dataframe that I would have to merge but that is besides the point. 标签位于我必须合并的其他数据框中,但这并不重要。

What I am trying to do now is remove all of the stopwords from each string in each row. 我现在想做的是从每一行的每个字符串中删除所有停用词。

After some research the code I am using looks like this: 经过研究后,我正在使用的代码如下所示:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ht_comments_only_no_stop['All_Comments'] = ht_comments_only_summary['All_Comments'].apply(lambda x: [item for item in x if item not in stop_words])

The ht_comments_only_summary is basically what you see in the first picture above. ht_comments_only_summary基本上就是您在上方第一张图片中看到的内容。

The problem is that now when I try looking at "ht_comments_only_no_stop" I see: 问题是,现在当我尝试查看“ ht_comments_only_no_stop”时,我看到了:

在此处输入图片说明

But what I need is the output to just look like the first picture in dataframe format minus all the stopwords under the "All_Comments" column. 但是我需要的是输出看起来像数据帧格式的第一张图片,减去“ All_Comments”列下的所有停用词。

Any help would be greatly appreciated. 任何帮助将不胜感激。

Ok, I figured it out. 好的,我知道了。

First, there is a different issue which was that I needed to break down the strings into a list of words. 首先,还有一个不同的问题,那就是我需要将字符串分解成单词列表。

Then after that I can successfully remove the stopwords. 然后,我可以成功删除停用词。

Finally I was able to convert the output back into a dataframe. 最终,我能够将输出转换回一个数据帧。

Best 最好

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM