简体   繁体   中英

Python regex to get all the words in a tweet that are not @mention or #hashtag

I want to get the words of a tweet that are not a mention (starting with @) or a hashtag (starting with #).

my code is like:

import re
pattern=r'(?u)\b\w\w+\b'
pattern=re.compile(pattern)
pattern.findall('this is a tweet #hashtag @mention')

The result with this regex is this is a tweet hashtag mention

but I don't want the hashtag and mention in the result. I want the result to be:

this is a tweet

Note that I can't use whitespace instead of \\b because the output for .this is a tweet (note the . at the beginning) should also be [this, is, a, tweet] \\b forces the start of a word to be any non-alphanumeric but if I use \\s then this won't be in the results.

(?<![#@])\b\w+\b

You can use this.See demo.

https://regex101.com/r/KzHvuy/2

If you are open to solutions other than regex , then you can make use of filter and lambda function for desired result.

a = 'this is a tweet #hashtag @mention'
" ".join(filter(lambda x:x[0]!='#' and x[0]!='@' , a.split()))

'this is a tweet'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM