Hi everyone I'm trying to extract URL from a file With the specific ending of ".eu" like.com.
I have this code to get a list of URLs but not with a specific ending. Can anyone improve it to get a specific TLD at the end?
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line).
example of lines and expected results.
akijsdijas adsfaasd asfda https://www.google.eu/asd34a/as3df asdfs dsf76
a56 64ijas adsfaasd asfda https://www.facebook.eu/asd34a/as3df asdfs345 dsf76
fghddijas adsfaasd asfda https://www.facebook.com/asd34a/as3df asdfs dsf76
Expected results:
You may use
re.findall(r'https?://\S*?\.eu\b', line)
See the regex demo .
The regex matches:
https?://
- http://
or https://
\S*?
- any 0+ non-whitespace chars, as few as possible \.eu\b
- a .eu
followed with a non-word char or end of string. try this
urls = re.findall(r'https?://\S*\.eu\b')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.