I have reviewed various links but all showed how to replace multiple words in one pass. However, instead of words I want to replace patterns eg
RT @amrightnow: "The Real Trump" Trump About You" Watch Make #1 https:\\/\\/t.co\\/j58e8aacrE #tcot #pjnet #1A #2A #Tru mp #trump2016 https:\\/\\/t.co\…
When I perform the following two commands on the above text I get the desired output
result = re.sub(r"http\S+","",sent)
result1 = re.sub(r"@\S+","",result)
This way I am removing all the urls and @(handlers from the tweet). The output will be something like follows:
>>> result1
'RT "The Real Trump" Trump About You" Watch Make #1 #tcot #pjnet #1A #2A #Trump #trump2016 '
Could someone let me know what is the best way to do it? I will be basically reading tweets from a file. I want to read each tweet and replace these handlers and urls with blanks.
You need the regex "or" operator which is the pipe |
:
re.sub(r"http\S+|@\S+","",sent)
If you have a long list of patterns that you want to remove, a common trick is to use join
to create the regular expression:
to_match = ['http\S+',
'@\S+',
'something_else_you_might_want_to_remove']
re.sub('|'.join(to_match), '', sent)
You can use an "or" pattern by separating the patterns with |
:
import re
s = u'RT @amrightnow: "The Real Trump" Trump About You" Watch Make #1 https:\/\/t.co\/j58e8aacrE #tcot #pjnet #1A #2A #Tru mp #trump2016 https:\/\/t.co\u2026'
result = re.sub(r"http\S+|@\S+", "", s)
print result
Output
RT "The Real Trump" Trump About You" Watch Make #1 #tcot #pjnet #1A #2A #Tru mp #trump2016
See the subsection '|'
in the regular expression syntax documentation.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.