[英]Python Scrapy: Return list of URLs scraped
I am using scrapy to scrape all the links off single domain.我正在使用 scrapy 从单个域中刮掉所有链接。 I am following all links on the domain but saving all links off the domain.
我正在关注域上的所有链接,但将所有链接保存在域外。 The following scraper works correctly, but I can't access member variables from within the scraper since I am running it with a
CrawlerProcess
.以下刮板工作正常,但我无法从刮板内访问成员变量,因为我使用
CrawlerProcess
运行它。
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
on_domain_urls = set()
off_domain_urls = set()
def parse(self, response):
links = response.xpath('//a/@href')
for link in links:
url = link.get()
if 'example.com' in url and url not in self.on_domain_urls:
print('On domain links found: {}'.format(
len(self.on_domain_urls)))
self.on_domain_urls.add(url)
yield scrapy.Request(url, callback=self.parse)
elif url not in self.off_domain_urls:
print('Offf domain links found: {}'.format(
len(self.on_domain_urls)))
self.off_domain_urls.add(url)
process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links
How can I access off_domain_links?如何访问 off_domain_links? I could probably move it to a global scope but this seems hack.
我可能可以将其移至全局 scope 但这似乎是 hack。 I can also append to a file, but I'd like to avoid file I/O if possible.
我也可以 append 到一个文件,但如果可能的话我想避免文件 I/O。 Is there a better way to return aggregated data like this?
有没有更好的方法来返回这样的聚合数据?
Did you check the Itempipeline?你检查过Itempipeline吗? I think you'll have to use that in this scenario and decide what needs to be done with the variable.
我认为您必须在这种情况下使用它并决定需要对变量执行什么操作。
See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html请参阅: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.