Remove urls from twitter text after api search tweepy

Question

I crawl twitter data using Tweepy and python. All wel, I have a pandas dataframe with text of tweets. But after almost every tweet there is an shortened url like: .

I want to remove these from the text. I have this code, and I don't get i why it doesn't do the thing:

def preprocess2(raw_text):
    stopword_set = set(stopwords.words("english"))
    raw_text = re.sub(r'^https?:\/\/.*[\r\n]*', '', raw_text, flags=re.MULTILINE)
    return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', "", raw_text).lower().split() if i not in stopword_set])

input: "I need sugarbaby I'm going to cater for your needs take care of you pour out your mind to me tell me your worries i… https://dfdf/dfsd "

Excpected output:

"I need sugarbaby I'm going to cater for your needs take care of you pour out your mind to me tell me your worries i…"

Answer 1

In your example, the URL doesn't start at the beginning of the line. Therefore, the ^ in your regular expression doesn't match. Removing this single character should do the trick:

raw_text = re.sub(r'https?:\/\/.*[\r\n]*', '', raw_text, flags=re.MULTILINE)

Remove urls from twitter text after api search tweepy

Question

1 answers

solution1
0 2018-12-29 20:18:28

Remove urls from twitter text after api search tweepy

Question

1 answers

solution1 0 2018-12-29 20:18:28

solution1
0 2018-12-29 20:18:28