简体   繁体   中英

Regular Expression for URL in python

I want to delete all the URL in the sentence.
Here is my code:

import ijson
f = open("/content/drive/My Drive/PTT 爬蟲/content/MakeUp/PTT_MakeUp_content_0_1000.json")
objects = ijson.items(f, 'item')

for obj in list(objects):
    article = obj['content']
    ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # question here
    for r in ret:
      article = article.replace(r, "")
    print(article)

But URL with "http" still left in sentence.

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"

Any idea? Thanks for help.

One simple fix would be to just replace the pattern https?://\\S+ with empty string:

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://\S+', '', article_example)
print(output)

This prints:

眼影盤長這樣  說真的 很不好拍

My pattern assumes that whatever non whitespace characters which follow http:// or https:// are part of the URL.

The url starts with http and in your pattern you match [s*] which will match either a s or * in the character class .

I think you are looking for

https?:[a-zA-Z0-9_.+-/#~]+ 

Regex demo | Python demo

import re
regex = r"https?:[a-zA-Z0-9_.+-/#~]+ "
article = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
result = re.sub(regex, "", article)
print(result)

Result

眼影盤長這樣 說真的 很不好拍

A shortened expression, which is a bit broader match could also be matching 1+ times a non whitespace \\S+ char followed by 0+ times a space to match the trailing space as in your original pattern.

\bhttps?:\S+ *

Regex demo

Change the [s*] to s? . The former is a set of two characters. The latter is an optional character. There are websites like regex101.com that let you experiment with regular expressions in the Python dialect. It will explain the interpretation of each part of the regex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM