简体   繁体   中英

How to get all links from a list of domains with Scrapy?

I think what I need to do is very simple, but I am having trouble finding any good source that doesn't focus only on scraping a single domain.

I have a list of about 9,000 domains. For each of them, I have to check if a link to my site exists anywhere on their domain. Basically, I need a list of the sites from that list that link back to my site. So, although the input of URLs is 9,000, the result of my code will be much smaller.

Any tips for how to start doing this are greatly appreciated. I've done multiple Scrapy tutorials but this isn't something I've found info about yet.

Edit - here is the spider I'm currently working with:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from urllib.parse import urlparse


class JakeSpider(CrawlSpider):
    name = 'jake'
    allowed_domains = ['hivedigital.com','gofishdigital.com','quizzly.co']
    start_urls = ['http://hivedigital.com/', 'http://gofishdigital.com/', 'https://quizzly.co/']

    rules = (
        Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        #i = {}
        page = response.url
        domain = urlparse(page).netloc
        print("............", domain)
        links = response.xpath('//a/@href').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        #return i
        le = LinkExtractor()
        for link in le.extract_links(response):
            if link.url == 'http://twitter.com':
                yield {'link':link,'domain': domain}

you can use the LinkExtractor to get all links and then just select the ones you actually need.

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['http://domain1.com', 'http://domain2.com', ...]

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            if link.url == 'something I want':
                # do something

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM