简体   繁体   中英

Scraping multiple websites of different domains using python scrapy

I want to scrape two different websites' emails with their corresponding links, but I get two different emails with the same link. Actually there are vaious websites to scrape, but for simplicity, I have used only two urls. The code is given below:

import scrapy
import re
import time
urls = ['http://www.manorhouseohio.com', 'http://www.OtterCreekAL.com']
class TheknotSpider(scrapy.Spider):
    name = 'theknot'
    def start_requests(self):
        global url
        li = ['http://www.manorhouseohio.com','http://www.OtterCreekAL.com']
        for url in range(len(li)):
            yield scrapy.Request(li[url], callback=self.parse)
    def parse(self, response):
        mail_link = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
        html_text = response.xpath("//body").get()
        mail_without_link = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", html_text)
        contact = response.xpath("//a[contains(@href,'/contact')]/@href").get()
        if mail_link:
            x = mail_link[7:]
            yield {
            "Main Website": urls[url],
            "Email": x
            }
            time.sleep(1)
        elif mail_without_link:
            yield {
            "Main Website": urls[url],
            "Email": mail_without_link
            }
            time.sleep(1)
        elif contact:
                if contact=="/contact" or contact=="/contact/" or contact=="/contact-us" or contact=="/contact-us/":
                    contact_main = urls[url] + contact
                    yield scrapy.Request(contact_main, callback=self.parse_final)       
                else:
                    yield scrapy.Request(contact, callback=self.parse_final)
        else:
            pass
    def parse_final(self, response):
        mail = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
        text = response.xpath("//body").get()
        mail_text = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
        if mail:
            x = mail[7:]
            yield {
            "Main Website": urls[url],
            "Email": x
            }
            time.sleep(1)
        elif mail_text:
            yield {
            "Main Website": urls[url],
            "Email": mail_text
            }
            time.sleep(1)
        else: pass

The problem in your code is that scrapy does not execute synchronously. This means that your global url variable is not what you think it is. start_requests will (likely) fully execute before any parse functions are called and so url will be 1 inside all your parse methods.

Rather than using the global variables, access the url of the response with response.url and if you need the base url then use urllib.parse (or similar) to get it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM