[英]Scrapy read list of URLs from file to scrape?
I've just installed scrapy and followed their simple dmoz tutorial which works. 我刚刚安装了scrapy,并按照他们的简单dmoz 教程工作。 I just looked up basic file handling for python and tried to get the crawler to read a list of URL's from a file but got some errors. 我只是查找了python的基本文件处理,并试图让爬虫从文件中读取URL列表,但是出现了一些错误。 This is probably wrong but I gave it a shot. 这可能是错的,但我试了一下。 Would someone please show me an example of reading a list of URL's into scrapy? 有人请告诉我一个将URL列表读入scrapy的例子吗? Thanks in advance. 提前致谢。
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
f = open("urls.txt")
start_urls = f
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
You were pretty close. 你非常接近。
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
...better still would be to use the context manager to ensure the file's closed as expected: ...更好的方法是使用上下文管理器确保文件按预期关闭:
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
If Dmoz expects just filenames in the list, you have to call strip on each line. 如果Dmoz只期望列表中的文件名,则必须在每一行上调用strip。 Otherwise you get a '\\n' at the end of each URL. 否则,您会在每个URL的末尾得到一个'\\ n'。
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [l.strip() for l in open('urls.txt').readlines()]
Example in Python 2.7 Python 2.7中的示例
>>> open('urls.txt').readlines()
['http://site.org\n', 'http://example.org\n', 'http://example.com/page\n']
>>> [l.strip() for l in open('urls.txt').readlines()]
['http://site.org', 'http://example.org', 'http://example.com/page']
Arise with similar question when writing my Scrapy helloworld. 在写我的Scrapy helloworld时会出现类似的问题。 Beside reading urls from a file, you might also need to input file name as an argument. 除了从文件中读取URL之外,您可能还需要输入文件名作为参数。 This can be done by the Spider argument mechanism. 这可以通过Spider参数机制来完成。
My example: 我的例子:
class MySpider(scrapy.Spider):
name = 'my'
def __init__(self, config_file = None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
with open(config_file) as f:
self._config = json.load(f)
self._url_list = self._config['url_list']
def start_requests(self):
for url in self._url_list:
yield scrapy.Request(url = url, callback = self.parse)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.