[英]Extract URL from lines with specific TLD ReGex
Hi everyone I'm trying to extract URL from a file With the specific ending of ".eu" like.com.大家好,我正在尝试从具有“.eu”特定结尾的文件中提取 URL,例如.com。
I have this code to get a list of URLs but not with a specific ending.我有这段代码来获取 URL 列表,但没有特定的结尾。 Can anyone improve it to get a specific TLD at the end?任何人都可以改进它以最终获得特定的 TLD 吗?
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line).
example of lines and expected results.行和预期结果的示例。
akijsdijas adsfaasd asfda https://www.google.eu/asd34a/as3df asdfs dsf76 akijsdijas adsfaasd asfda https://www.google.eu/asd34a/as3df asdfs dsf76
a56 64ijas adsfaasd asfda https://www.facebook.eu/asd34a/as3df asdfs345 dsf76 a56 64ijas adsfaasd asfda https://www.facebook.eu/asd34a/as3df asdfs345 dsf76
fghddijas adsfaasd asfda https://www.facebook.com/asd34a/as3df asdfs dsf76 fghddijas adsfaasd asfda https://www.facebook.com/asd34a/as3df asdfs dsf76
Expected results:预期成绩:
You may use您可以使用
re.findall(r'https?://\S*?\.eu\b', line)
See the regex demo .请参阅正则表达式演示。
The regex matches:正则表达式匹配:
https?://
- http://
or https://
https?://
- http://
或https://
\S*?
- any 0+ non-whitespace chars, as few as possible - 任何 0+ 个非空白字符,尽可能少\.eu\b
- a .eu
followed with a non-word char or end of string. \.eu\b
- .eu
后跟非单词字符或字符串结尾。try this尝试这个
urls = re.findall(r'https?://\S*\.eu\b')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.