[英]Removing common junks from tweets for topic modeling
I am trying to remove common junks like RT
, all the strings starting with @
, and all the URLs. 我正在尝试删除常见的垃圾邮件,例如
RT
,以@
开头的所有字符串以及所有URL。 The way I treated it was like this: 我对待它的方式是这样的:
prefixes=["http","ftp","@","#","RT"]
for prefix in prefixes:
for word in final_tweet:
if word.startswith(prefix):
print "starts with prefix"
word=''
while this code sometimes removes junks (and always detects the junk), it doesn't always remove them. 尽管此代码有时会删除垃圾(并始终检测到垃圾),但并不总是将其删除。 So I wonder what the problem could be?
所以我想知道可能是什么问题?
Here's some examples of the output: 这是输出的一些示例:
['RT', '@NadelParis:', 'Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https:/\xe2\x80\xa6']
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
starts with prefix
['Going2LOVEorKILL?Download', 'NOW!', 'https://t.co/xilNh66e34', '@CrookedIntriago', '@Seven13music', '@UMG', '\xe3\x82\x8f\xe3\x81\x9f\xe3\x81\x97\xe3\x81\xaf\xe3\x80\x81\xe3\x81\x82\xe3\x81\xaa\xe3\x81\x9f\xe3\x82\x92\xe6\x84\x9b\xe3\x81\x97\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99!', 'RTPlz<3', 'https://t.co/I40s8x3QAV']
['RT', '@dbrandSkins:', 'Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
starts with prefix
starts with prefix
['Dear', 'Apple,', 'T9', 'dialing', 'optional.', 'Get', 'shit', 'together.', 'Signed,\nEveryone']
['RT', '@WeLoveRobDyrdek:', 'This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
starts with prefix
starts with prefix
['This', 'dog', '', 'https://t.co/5N86jYipOI']
null found
starts with prefix
['RT', '@sayingsforgirls:', 'Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
starts with prefix
starts with prefix
['Do', 'touch', 'MY', 'iPhone.', "It's", 'usPhone,', 'wePhone,', 'ourPhone,']
['RT', '@BrianaaSymonee:', 'says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']
starts with prefix
starts with prefix
['says', 'imma', 'dog,', 'takes', 'one', 'know', 'one...']
You can check for each prefix 您可以检查每个前缀
>>> for prefix in prefixes:
... final_tweet = [ w for w in final_tweet if not w.startswith(prefix)]
#Python IRC频道的某人给出的答案:
final_tweet = [word for word in final_tweet if not any (word.startswith(prefix) for prefix in prefixes)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.