简体   繁体   English

如何仅从python中的文本文件获取基本URL?

[英]How to get just base urls from a text file in python?

I have a text file name weburl which have many urls I want to get only base urls using regex weburls 我有一个文本文件名weburl ,其中有许多网址,我只想使用正则表达式weburl来获取基本网址

 wikimapia.org/1649944/Bahawalpur-Railway-Station
 panoramio.com/photo/84118355
 wikimapia.org/1649944/Bahawalpur-Railway-Station
 nativepakistan.com/photos-of-bahawalpur
 defence.pk/threads/pictures-of-pakistan-railways.303027
 nativepakistan.com/photos-of-bahawalpur
 panoramio.com/photo/51311162
 https://hiveminer.com/User/Pakistan Rail Buff

need this 需要这个

 wikimapia.org
 panoramio.com
 wikimapia.org
 nativepakistan.com
 defence.pk
 nativepakistan.com
 panoramio.com
 https://hiveminer.com

Using regex how can i do it? 使用正则表达式我该怎么办?

One solution could be: 一种解决方案可能是:

^(?:\w+://)?.*?(?::\d+)?(?=/|$)

It matches begining of line ( ^ ) followed by an optional protocol specification, eg https:// ( (?:\\w+://)? ). 它与行( ^ )开头,后跟可选协议规范,例如https://(?:\\w+://)? )。 Then it matches any number of anything ( .*? ) up to an optional port specification - like :80 ( (?::\\d+)? ). 然后,它匹配任意数量的任何内容.*? )直至可选的端口规范-例如:80(?::\\d+)? )。 Finally it checks that the match is followed by a / or an end of line $ (the psitive look ahead (?=/|$) ). 最后,它检查匹配项后是否跟有/ 行尾$ (向前看(?=/|$) ps (?=/|$) )。

Check it out here at regex101 . 在regex101处检查

Note that if you don't want to match the port part, you could move it in to the positive look ahead. 请注意,如果您不想匹配端口部分,则可以将其移至正面。 Ie ^(?:\\w+://)?.*?(?=(?::\\d+)?(?:/|$)) ^(?:\\w+://)?.*?(?=(?::\\d+)?(?:/|$))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM