从txt文件解析网址

Question

I am trying to parse a txt file which looks like this: 我正在尝试解析如下所示的txt文件：

Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

I need to read the file and extract the part with the url after 'Disallow' but also ignoring the comments. 我需要读取文件并在“不允许”之后提取带有网址的部分，但也要忽略注释。 Thanks in advance. 提前致谢。

Answer 1

If you are trying to parse a robots.txt file then you should use the robotparser module: 如果您尝试解析robots.txt文件，则应使用robotparser模块：

>>> import robotparser

>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.your_url.com/robots.txt")
>>> r.read()

Then just check: 然后只需检查：

>>> r.can_fetch("*", "/foo.html")
False

Answer 2

Assuming that there's no # in the URLs: 假定URL中没有# ：

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split("#", 1)[0] for line in infile]

Allowing for the existence of # , but assuming that comments beginning with # and the urls are separated by a space: 允许存在# ，但是假设以#开头的注释和网址之间用空格分隔：

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split(" #", 1)[0] for line in infile]

从txt文件解析网址

问题描述

2 个解决方案

解决方案1
5 2013-08-27 23:06:53

解决方案2
1 2013-08-27 23:06:44

从txt文件解析网址

问题描述

2 个解决方案

解决方案1 5 2013-08-27 23:06:53

解决方案2 1 2013-08-27 23:06:44

解决方案1
5 2013-08-27 23:06:53

解决方案2
1 2013-08-27 23:06:44