简体   繁体   English

从推文Python中删除停用词

[英]Removing stopwords from tweets Python

I am trying to remove stopwords from tweets that I have imported from Twitter. 我正在尝试从Twitter导入的推文中删除停用词。 After removing the stopwords, the list of strings will be placed in a new column in the same row. 删除停用词后,字符串列表将放置在同一行的新列中。 I can easily accomplish this one row at a time however when attempting to loop the method over the whole Data Frame does not seem to succeed. 我可以轻松地一次完成这一行,但是尝试在整个数据帧上循环该方法似乎未成功。

How do would I do this? 我该怎么做?

Snippet of my data: 我的数据片段:

tweets['text'][0:5]
Out[21]: 
0    Why #litecoin will go over 50 USD soon ? So ma...
1    get 20 free #bitcoin spins at...
2    Are you Bullish or Bearish on #BMW? Start #Tra...
3    Are you Bullish or Bearish on the S&P 500?...
4    TIL that there is a DAO ExtraBalance Refund. M...

The following works in a single row scenario: 以下在单行方案中起作用:

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tweets['text-filtered'] = ""

word_tokens = word_tokenize(tweets['text'][1])
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
tweets['text-filtered'][1] = filtered_sentence

tweets['text-filtered'][1]
Out[22]: 
['get',
 '20',
 'free',
 '#',
 'bitcoin',
 'spins',
 'withdraw',
 'free',
 '#',
 'btc',
 '#',
 'freespins',
 '#',
 'nodeposit',
 '#',
 'casino',
 '#',
 '...',
 ':']

My attempt at a loop does not succeed: 我的循环尝试未成功:

for i in tweets:
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False))
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][i] = filtered_sentence

A snippet of the traceback: 追溯的摘要:

Traceback (most recent call last):

  File "<ipython-input-23-6d7dace7a2d0>", line 2, in <module>
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False))

...

KeyError: 'id'

Based off @Prune's reply, I have managed to correct my mistakes. 根据@Prune的回复,我设法纠正了我的错误。 Here is a potential solution: 这是一个潜在的解决方案:

count = 0    
for i in tweets['text']:
    word_tokens = word_tokenize(i)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    tweets['text-filtered'][count] = filtered_sentence
    count += 1

My previous attempt was looping through the columns of the Data Frame, tweets. 我以前的尝试是遍历数据框的各个列,鸣叫。 The first column in tweets was "id". 推文中的第一列是“ id”。

tweets.columns
Out[30]: 
Index(['id', 'user_bg_color', 'created', 'geo', 'user_created', 'text',
       'polarity', 'user_followers', 'user_location', 'retweet_count',
       'id_str', 'user_name', 'subjectivity', 'coordinates',
       'user_description', 'text-filtered'],
      dtype='object')

You're confused about list indexing: 您对列表索引感到困惑:

for i in tweets:
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False))
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][i] = filtered_sentence

Note that tweets is a dictionary; 请注意, tweets是字典; tweets['text'] list of strings. tweets['text']字符串列表。 Thus, for i in tweets returns all of the keys in tweets : the dictionary keys in arbitrary order. 因此, for i in tweets返回所有在按键的tweets :以任意顺序的字典键。 It appears that "id" is the first one returned. 似乎“ id”是返回的第一个。 When you try to assign tweets['text-filtered']['id'] = filtered_sentence , there just is no such element. 当您尝试分配tweets['text-filtered']['id'] = filtered_sentence ,就没有这样的元素。

Try coding more gently: start at the inside, code a few lines at a time, and work your way up to more complex control structures. 尝试更轻柔地编码:从内部开始,一次编码几行,然后逐步发展到更复杂的控制结构。 Debug each addition before you go on. 在继续之前,请调试每个添加项。 Here, you seem to have lost your sense of what is a numeric index, what is a list, and what is a dictionary. 在这里,您似乎对什么是数字索引,什么是列表以及什么是字典失去了认识。

Since you haven't done any visible debugging, or provided the context, I can't fix the whole program for you -- but this should get you started. 由于您尚未执行任何可见的调试或提供上下文,因此我无法为您修复整个程序-但这应该可以帮助您入门。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM