简体   繁体   中英

Perfect regex for extracting url with re.findall()

I was googling regular expressions for extracting url, but they don't work in one example or python interpreter simply hangs.

The url was ' http://www.computerworld.ru/articles/Naslednik-Hadoop-uskoryaet-evolyutsiyu-analiza-dannyh '

Regex for url in python with re.findall :

http[s]?:\/\/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

if you need capturing group :

(http[s]?:\/\/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)


http matches the characters http literally (case sensitive)
[s]? match a single character present in the list
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed
s the literal character s (case sensitive)
: matches the character : literally
\/ matches the character / literally
\/ matches the character / literally
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
[a-zA-Z] match a single character present in the list below
a-z a single character in the range between a and z (case sensitive)
A-Z a single character in the range between A and Z (case sensitive)
2nd Alternative: [0-9]
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
3rd Alternative: [$-_@.&+]
[$-_@.&+] match a single character present in the list below
$-_ a single character in the range between $ and _
@.&+ a single character in the list @.&+ literally (case sensitive)
4th Alternative: [!*\(\),]
[!*\(\),] match a single character present in the list below
!* a single character in the list !* literally
\( matches the character ( literally
\) matches the character ) literally
, the literal character ,
5th Alternative: (?:%[0-9a-fA-F][0-9a-fA-F])
(?:%[0-9a-fA-F][0-9a-fA-F]) Non-capturing group
% matches the character % literally
[0-9a-fA-F] match a single character present in the list below
0-9 a single character in the range between 0 and 9
a-f a single character in the range between a and f (case sensitive)
A-F a single character in the range between A and F (case sensitive)
[0-9a-fA-F] match a single character present in the list below
0-9 a single character in the range between 0 and 9
a-f a single character in the range between a and f (case sensitive)
A-F a single character in the range between A and F (case sensitive)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM