如何使用Scrapy从域列表中获取所有链接？

Question

I think what I need to do is very simple, but I am having trouble finding any good source that doesn't focus only on scraping a single domain. 我认为我需要做的非常简单，但是我很难找到任何不仅仅关注于抓取单个域的优质资源。

I have a list of about 9,000 domains. 我有大约9000个域的列表。 For each of them, I have to check if a link to my site exists anywhere on their domain. 对于他们每个人，我必须检查在他们的域中是否存在指向我的网站的链接。 Basically, I need a list of the sites from that list that link back to my site. 基本上，我需要该列表中链接回我的网站的网站列表。 So, although the input of URLs is 9,000, the result of my code will be much smaller. 因此，尽管URL的输入为9,000，但是我的代码的结果却要小得多。

Any tips for how to start doing this are greatly appreciated. 非常感谢任何有关如何开始执行此操作的提示。 I've done multiple Scrapy tutorials but this isn't something I've found info about yet. 我已经完成了多个Scrapy教程，但这还不是我发现的信息。

Edit - here is the spider I'm currently working with: 编辑-这是我目前正在使用的蜘蛛：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from urllib.parse import urlparse


class JakeSpider(CrawlSpider):
    name = 'jake'
    allowed_domains = ['hivedigital.com','gofishdigital.com','quizzly.co']
    start_urls = ['http://hivedigital.com/', 'http://gofishdigital.com/', 'https://quizzly.co/']

    rules = (
        Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        #i = {}
        page = response.url
        domain = urlparse(page).netloc
        print("............", domain)
        links = response.xpath('//a/@href').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        #return i
        le = LinkExtractor()
        for link in le.extract_links(response):
            if link.url == 'http://twitter.com':
                yield {'link':link,'domain': domain}

Answer 1

you can use the LinkExtractor to get all links and then just select the ones you actually need. 您可以使用LinkExtractor来获取所有链接，然后只需选择实际需要的链接即可。

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['http://domain1.com', 'http://domain2.com', ...]

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            if link.url == 'something I want':
                # do something

如何使用Scrapy从域列表中获取所有链接？

问题描述

1 个解决方案

解决方案1
0 2017-09-15 00:20:49

如何使用Scrapy从域列表中获取所有链接？

问题描述

1 个解决方案

解决方案1 0 2017-09-15 00:20:49

解决方案1
0 2017-09-15 00:20:49