从推文中删除常见的垃圾邮件以进行主题建模

Question

I am trying to remove common junks like RT , all the strings starting with @ , and all the URLs. 我正在尝试删除常见的垃圾邮件，例如RT ，以@开头的所有字符串以及所有URL。 The way I treated it was like this: 我对待它的方式是这样的：

prefixes=["http","ftp","@","#","RT"]

for prefix in prefixes:
            for word in final_tweet:
                    if word.startswith(prefix):
                            print "starts with prefix"
                            word=''

while this code sometimes removes junks (and always detects the junk), it doesn't always remove them. 尽管此代码有时会删除垃圾（并始终检测到垃圾），但并不总是将其删除。 So I wonder what the problem could be? 所以我想知道可能是什么问题？

Here's some examples of the output: 这是输出的一些示例：

['RT', '@NadelParis:', 'Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https:/\xe2\x80\xa6']
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
['Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https://t.co/I40s8x3QAV']

['RT', '@dbrandSkins:', 'Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
starts with prefix
starts with prefix
['Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
['RT', '@WeLoveRobDyrdek:', 'This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
starts with prefix
starts with prefix
['This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
['RT', '@sayingsforgirls:', 'Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
starts with prefix
starts with prefix
['Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
['RT', '@BrianaaSymonee:', 'says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']
starts with prefix
starts with prefix
['says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']

Answer 1

You can check for each prefix 您可以检查每个前缀

>>> for prefix in prefixes:
...     final_tweet = [ w for w in final_tweet if not w.startswith(prefix)]

Answer 2

#Python IRC频道的某人给出的答案：

final_tweet = [word for word in final_tweet if not any (word.startswith(prefix) for prefix in prefixes)]

从推文中删除常见的垃圾邮件以进行主题建模

问题描述

2 个解决方案

解决方案1
1 2015-10-29 21:13:16

解决方案2
0 已采纳 2015-10-29 22:12:34

从推文中删除常见的垃圾邮件以进行主题建模

问题描述

2 个解决方案

解决方案1 1 2015-10-29 21:13:16

解决方案2 0 已采纳 2015-10-29 22:12:34

解决方案1
1 2015-10-29 21:13:16

解决方案2
0 已采纳 2015-10-29 22:12:34