简体   繁体   English

从推文中删除常见的垃圾邮件以进行主题建模

[英]Removing common junks from tweets for topic modeling

I am trying to remove common junks like RT , all the strings starting with @ , and all the URLs. 我正在尝试删除常见的垃圾邮件,例如RT ,以@开头的所有字符串以及所有URL。 The way I treated it was like this: 我对待它的方式是这样的:

prefixes=["http","ftp","@","#","RT"]

for prefix in prefixes:
            for word in final_tweet:
                    if word.startswith(prefix):
                            print "starts with prefix"
                            word=''

while this code sometimes removes junks (and always detects the junk), it doesn't always remove them. 尽管此代码有时会删除垃圾(并始终检测到垃圾),但并不总是将其删除。 So I wonder what the problem could be? 所以我想知道可能是什么问题?

Here's some examples of the output: 这是输出的一些示例:

['RT', '@NadelParis:', 'Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https:/\xe2\x80\xa6']
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
['Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https://t.co/I40s8x3QAV']

['RT', '@dbrandSkins:', 'Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
starts with prefix
starts with prefix
['Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
['RT', '@WeLoveRobDyrdek:', 'This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
starts with prefix
starts with prefix
['This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
['RT', '@sayingsforgirls:', 'Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
starts with prefix
starts with prefix
['Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
['RT', '@BrianaaSymonee:', 'says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']
starts with prefix
starts with prefix
['says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']

You can check for each prefix 您可以检查每个前缀

>>> for prefix in prefixes:
...     final_tweet = [ w for w in final_tweet if not w.startswith(prefix)]

#Python IRC频道的某人给出的答案:

final_tweet = [word for word in final_tweet if not any (word.startswith(prefix) for prefix in prefixes)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM