简体   繁体   English

使用Python正则表达式过滤推文

[英]Using Python regex to filter tweets

I'm trying to create a query that filters tweets by the @ or # tags. 我正在尝试创建一个通过@或#标签过滤推文的查询。

So I just want the results for either @Obama or #Obama but not Obama. 所以我只想要@Obama或#Obama的结果,而不是奥巴马。 This is what I have so far: 这是我到目前为止:

re.compile(r'\b(?:#|@|)*%s*\b' % re.escape(obama), re.IGNORECASE)

Thanks for the replies....I tries both answers and what seems to work in my situation is: 感谢回复....我尝试了两个答案,在我的情况下似乎有用的是:

 re.compile(r'\b[#@]*%s\b' % re.escape(term), re.IGNORECASE)  

'term' is an element in a list which I iterate over. 'term'是我迭代的列表中的元素。 This then returns tweets that has either a # or @ pre-pended to the 'term'. 然后,这将返回具有#或@预先填写'term'的推文。 Itried not using '*' but It was giving out exceptions. Itried不使用'*',但它给出了异常。

Thanks 谢谢

Try using this regular expression: 尝试使用此正则表达式:

r'\b[#@]{name}\b'.format(name=re.escape('Obama'))

Character class [%@] works faster then choice group (?:#|@) . 字符类[%@]工作速度比选择组(?:#|@)快。

So, we begin with word boundary \\b , then follows # or @ . 所以,我们从单词boundary \\b开始,然后是#@ Then goes substitute from obama variable. 然后从obama变量替代。 Then goes the trailing boundary. 然后去尾随边界。

In the question you used * quantifiers which repeat the previous expression from 0 to infinity times. 在你使用*量词的问题中,重复前一个表达式从0到无穷大时间。 There is no reason to repeat # and @ symbols. 没有理由重复#@符号。 Also, the last sybmol of obama shouldn't be repeated either. 此外, obama的最后一个sybmol也不应重复。

If this is purely to do with regexes, and has nothing to do with Twitter per se (aside from the fact that you're filtering tweets), then the regex you want is this: 如果这纯粹与正则表达式有关,并且与Twitter本身无关(除了你正在过滤推文的事实),那么你想要的正则表达式是这样的:

compiled = re.compile(r'\b[#@]obama\b', re.IGNORECASE)

If you want an example of some code doing something similar to what you're doing, take a look at this as it might be a worthwhile example: 如果你想要一些代码做一些类似于你正在做的事情的例子,那么看一下这可能是一个值得的例子:

https://github.com/kgaughan/is-on-a-train/blob/master/isonatrain.py

That code tracks a bunch of users, looking for certain trigger phrases, and writes out a HTML file based on what they say. 该代码跟踪一堆用户,寻找某些触发短语,并根据他们所说的内容写出HTML文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM