简体   繁体   中英

This python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well

I need to remove any URL in the tweets review. How to only remove the URL if it is found in the beginning of tweet?

I've try some code and this python code with regex successfully remove URL but if URL found in the beginning of tweets, all of the sentence will be remove as well.

re.sub(r'https?:\/\/.*[\r\n]*\S+', '', verbatim, flags = re.MULTILINE)

If URL found in the beginning of tweets, all of the sentence will be remove as well.

The pattern https?:\\/\\/.*[\\r\\n]*\\S+ matches http(optional s)://

Then the .* part matches until the end of the string, then this part [\\r\\n]* matches 0+ newlines and \\S+ will match 1+ non whitespace chars.

So the url is matched, followed by the rest of the string, a newline and 1+ non whitespace chars at the next line as well.

You could shorten the pattern to:

\bhttps?://\S+

Regex demo

Try making your regex lazy by adding ? and matching to the final space character

Also, added escaping for the backslashes

re.sub(r'https?://. ?[\\r\\n] [\\s?]', '', verbatim, flags = re.MULTILINE)

regex101 link to live demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM