[英]How to get just base urls from a text file in python?
I have a text file name weburl which have many urls I want to get only base urls using regex weburls 我有一个文本文件名weburl ,其中有许多网址,我只想使用正则表达式weburl来获取基本网址
wikimapia.org/1649944/Bahawalpur-Railway-Station
panoramio.com/photo/84118355
wikimapia.org/1649944/Bahawalpur-Railway-Station
nativepakistan.com/photos-of-bahawalpur
defence.pk/threads/pictures-of-pakistan-railways.303027
nativepakistan.com/photos-of-bahawalpur
panoramio.com/photo/51311162
https://hiveminer.com/User/Pakistan Rail Buff
need this 需要这个
wikimapia.org
panoramio.com
wikimapia.org
nativepakistan.com
defence.pk
nativepakistan.com
panoramio.com
https://hiveminer.com
Using regex how can i do it? 使用正则表达式我该怎么办?
One solution could be: 一种解决方案可能是:
^(?:\w+://)?.*?(?::\d+)?(?=/|$)
It matches begining of line ( ^
) followed by an optional protocol specification, eg https://
( (?:\\w+://)?
). 它与行( ^
)开头,后跟可选协议规范,例如https://
( (?:\\w+://)?
)。 Then it matches any number of anything ( .*?
) up to an optional port specification - like :80
( (?::\\d+)?
). 然后,它匹配任意数量的任何内容 ( .*?
)直至可选的端口规范-例如:80
( (?::\\d+)?
)。 Finally it checks that the match is followed by a /
or an end of line $
(the psitive look ahead (?=/|$)
). 最后,它检查匹配项后是否跟有/
或行尾$
(向前看(?=/|$)
ps (?=/|$)
)。
Check it out here at regex101 . 在regex101处检查 。
Note that if you don't want to match the port part, you could move it in to the positive look ahead. 请注意,如果您不想匹配端口部分,则可以将其移至正面。 Ie ^(?:\\w+://)?.*?(?=(?::\\d+)?(?:/|$))
即^(?:\\w+://)?.*?(?=(?::\\d+)?(?:/|$))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.