简体   繁体   中英

Extract URL from lines with specific TLD ReGex

Hi everyone I'm trying to extract URL from a file With the specific ending of ".eu" like.com.

I have this code to get a list of URLs but not with a specific ending. Can anyone improve it to get a specific TLD at the end?

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line).

example of lines and expected results.

akijsdijas adsfaasd asfda https://www.google.eu/asd34a/as3df asdfs dsf76

a56 64ijas adsfaasd asfda https://www.facebook.eu/asd34a/as3df asdfs345 dsf76

fghddijas adsfaasd asfda https://www.facebook.com/asd34a/as3df asdfs dsf76

Expected results:

https://www.google.eu

https://www.facebook.eu

You may use

re.findall(r'https?://\S*?\.eu\b', line)

See the regex demo .

The regex matches:

  • https?:// - http:// or https://
  • \S*? - any 0+ non-whitespace chars, as few as possible
  • \.eu\b - a .eu followed with a non-word char or end of string.

try this

urls = re.findall(r'https?://\S*\.eu\b')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM