简体   繁体   中英

Removing URL features from tokens in NLTK

I'm building a little 'trending' algorithm. The tokeniser works as originally intended, bar a couple of hiccups around URLs, which are causing some problems.

Obviously, as I'm pulling info from twitter, there are a lot of t.co URL shortner type links. I'd like to remove these as not 'words', preferably at the tokeniser stage, but am currently filtering them out post-fact. I can't (I don't think) run the tokens against a recognisable English whitelist, as again, Twitter, and contractions, etc.

My code that wraps around the function that pulls the top 10 most common words in any given period is:

tweets = Tweet.objects.filter(lang='en', created_at__gte=start, created_at__lte=end)
number_of_tweets = tweets.count()
most_popular = trending.run_all(start, end, "word").keys()[:10]
print "BEFORE", most_popular
for i, thing in enumerate(most_popular):
    try:
        if "/" in thing:
            most_popular.remove(thing)
            print i, thing, "Removed it."
    except UnicodeEncodeError, e:
        print "Unicode error", e
        most_popular.remove(thing)
print "NOW", most_popular`

That try/except block should, in theory, remove any of the URL featured words from the token list - except it doesn't, I'm always left with a couple.

Running trending.run_all on a time period gives, for example:

[u'//t.co/r6gkL104ai/nKate', u'EXPLAIN', u'\\U0001f62b\\U0001f62d/nRT', u'woods', u'hanging', u'ndtv/nRT', u'BenDohertyCorro', u'\न\ि\र\्\द\ो\ष_\ब\ा\प\ू\…/nPolice', u'LAST', u'health/nTime']

Running the rest of the code imported into python commandline gives:

0 //t.co/r6gkL104ai/nKate Removed it
1 😫😭/nRT Removed it
2 hanging 
3 ndtv/nRT Removed it
4 निर्दोष_बापू…/nPolice Removed it
5 health/nTime Removed it
6 Western 7 //t.co/4dhGoBpzR0 Removed it
8 //t.co/TkHhI7n…/nRT Removed it
9 //t.co/WmWkcG1dOz/nRT Removed it
10 bringing 
 ...
32 kids

NOW [u'EXPLAIN', u'woods', u'hanging', u'BenDohertyCorro', u'LAST', u'scolo', u'Western', u'//t.co/jB0TWYAJSI/nMe', u'BREAKINGNEWS', u'//t.co/9gYG8y5OKK', u'bringing', u'Valls', u'advices', u'Signatures', u'//t.co/vmQfyenXp4/nJury', u'strengthandcondition\u2026', u'HAPPENED', u'\u2705', u'\U0001f60f', u'//t.co/5JR8RXsJ87/nIs', u'Hamilton', u'Logging', u'Happening', u'Foundation', u'//t.co/gC959Q43QD/nRT', u'ISIS=CIA', u'Footnotes', u'ARYNEWSOFFICIAL', u'LoveMyLife', u'-they', u'B\xf6rse', u'InfoTerrorism', u'kids']

So for some reason, that little hunk isn't (consistently) cutting them out, or isn't acting as expected. This causes a particular problem with reverse lookups in Django, as I intend to use the top X phrases in a period as clickable links - obviously that breaks the lookup completely, and there (rightly) doesn't seem to be a way to Except out of that in the template, so I'd rather take care of this in the views.

It seems to me that the issue you are having is that you are deleting a list while iterating over it. The solution is simple: You should iterate on a copy of your list:

for i, thing in enumerate(most_popular[:]):

notice the '[:]' which will create a copy of your list.

The reason for this behavior can be found in this post .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM