简体   繁体   中英

Remove urls from twitter text after api search tweepy

I crawl twitter data using Tweepy and python. All wel, I have a pandas dataframe with text of tweets. But after almost every tweet there is an shortened url like: .

I want to remove these from the text. I have this code, and I don't get i why it doesn't do the thing:

def preprocess2(raw_text):
    stopword_set = set(stopwords.words("english"))
    raw_text = re.sub(r'^https?:\/\/.*[\r\n]*', '', raw_text, flags=re.MULTILINE)
    return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', "", raw_text).lower().split() if i not in stopword_set])

input: "I need sugarbaby I'm going to cater for your needs take care of you pour out your mind to me tell me your worries i… https://dfdf/dfsd "

Excpected output:

"I need sugarbaby I'm going to cater for your needs take care of you pour out your mind to me tell me your worries i…"

In your example, the URL doesn't start at the beginning of the line. Therefore, the ^ in your regular expression doesn't match. Removing this single character should do the trick:

raw_text = re.sub(r'https?:\/\/.*[\r\n]*', '', raw_text, flags=re.MULTILINE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM