I think what I need to do is very simple, but I am having trouble finding any good source that doesn't focus only on scraping a single domain.
I have a list of about 9,000 domains. For each of them, I have to check if a link to my site exists anywhere on their domain. Basically, I need a list of the sites from that list that link back to my site. So, although the input of URLs is 9,000, the result of my code will be much smaller.
Any tips for how to start doing this are greatly appreciated. I've done multiple Scrapy tutorials but this isn't something I've found info about yet.
Edit - here is the spider I'm currently working with:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse
class JakeSpider(CrawlSpider):
name = 'jake'
allowed_domains = ['hivedigital.com','gofishdigital.com','quizzly.co']
start_urls = ['http://hivedigital.com/', 'http://gofishdigital.com/', 'https://quizzly.co/']
rules = (
Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),
)
def parse_item(self, response):
#i = {}
page = response.url
domain = urlparse(page).netloc
print("............", domain)
links = response.xpath('//a/@href').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
#return i
le = LinkExtractor()
for link in le.extract_links(response):
if link.url == 'http://twitter.com':
yield {'link':link,'domain': domain}
you can use the LinkExtractor
to get all links and then just select the ones you actually need.
from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
class MySpider(Spider):
name = 'myspider'
start_urls = ['http://domain1.com', 'http://domain2.com', ...]
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
if link.url == 'something I want':
# do something
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.