简体   繁体   English

Python Regex表达式,用于从文本中提取主题标签

[英]Python Regex expression for extracting hashtags from text

I'm processing some tweets I mined during the election and I need to a way to extract hashtags from tweet text while accounting punctuation, non-unicode characters, etc while still retaining the hashtag in the outputted list. 我正在处理我在选举中挖掘的一些推文,我需要一种从推文中提取标签的方法,同时考虑标点符号,非Unicode字符等,同时仍将标签保留在输出列表中。

For example, the orignal text from a tweet looks like: 例如,一条推文中的原始文本看起来像:

I'm with HER! 我和她在一起! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn #NeverTrump #DumpTrump#imwithher🇺🇸@布鲁克林威廉斯堡

and when turned into a string in python (or even put into a code block on this site), the special characters near the end are changed, producing this: 当在python中变成字符串(甚至放在该站点的代码块中)时,末尾的特殊字符也会更改,从而产生以下结果:

"I'm with HER! #NeverTrump #DumpTrump #imwithherdY\xd8\xa7dY\xd8, @ Williamsburg, Brooklyn"

now I would like to parse the string to be turned into a list like this: 现在我想将字符串解析为如下所示的列表:

['#NeverTrump','#DumpTrump', '#imwithher']

I'm currently using this expression where str is the above string: 我目前正在使用此表达式,其中str是以上字符串:

tokenizedTweet = re.findall(r'(?i)\#\w+', str, flags=re.UNICODE)

however, I'm getting this as output: 但是,我将其作为输出:

['#NeverTrump', '#DumpTrump', '#imwithherdY\xd8']

How would I account for 'dY\\xd8' in my regex to exclude it? 我如何在正则表达式中考虑“ dY \\ xd8”以排除它? I'm also open to other solutions not involving regex. 我也欢迎其他不涉及正则表达式的解决方案。

Yah, about the solution not involving regex. 是的,关于不涉及正则表达式的解决方案。 ;) ;)

# -*- coding: utf-8 -*-
import string 
tweets = []

a = "I'm with HER! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn"

# filter for printable characters then
a = ''.join(filter(lambda x: x in string.printable, a))

print a

for tweet in a.split(' '):
    if tweet.startswith('#'):
        tweets.append(tweet.strip(','))

print tweets

and tada: ['#NeverTrump', '#DumpTrump', '#imwithher'] 和tada:['#NeverTrump','#DumpTrump','#imwithher']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM