I want to scrape two different websites' emails with their corresponding links, but I get two different emails with the same link. Actually there are vaious websites to scrape, but for simplicity, I have used only two urls. The code is given below:
import scrapy
import re
import time
urls = ['http://www.manorhouseohio.com', 'http://www.OtterCreekAL.com']
class TheknotSpider(scrapy.Spider):
name = 'theknot'
def start_requests(self):
global url
li = ['http://www.manorhouseohio.com','http://www.OtterCreekAL.com']
for url in range(len(li)):
yield scrapy.Request(li[url], callback=self.parse)
def parse(self, response):
mail_link = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
html_text = response.xpath("//body").get()
mail_without_link = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", html_text)
contact = response.xpath("//a[contains(@href,'/contact')]/@href").get()
if mail_link:
x = mail_link[7:]
yield {
"Main Website": urls[url],
"Email": x
}
time.sleep(1)
elif mail_without_link:
yield {
"Main Website": urls[url],
"Email": mail_without_link
}
time.sleep(1)
elif contact:
if contact=="/contact" or contact=="/contact/" or contact=="/contact-us" or contact=="/contact-us/":
contact_main = urls[url] + contact
yield scrapy.Request(contact_main, callback=self.parse_final)
else:
yield scrapy.Request(contact, callback=self.parse_final)
else:
pass
def parse_final(self, response):
mail = response.xpath("//a[contains(@href,'mailto:')]/@href").get()
text = response.xpath("//body").get()
mail_text = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
if mail:
x = mail[7:]
yield {
"Main Website": urls[url],
"Email": x
}
time.sleep(1)
elif mail_text:
yield {
"Main Website": urls[url],
"Email": mail_text
}
time.sleep(1)
else: pass
The problem in your code is that scrapy
does not execute synchronously. This means that your global
url
variable is not what you think it is. start_requests
will (likely) fully execute before any parse functions are called and so url
will be 1 inside all your parse
methods.
Rather than using the global variables, access the url of the response with response.url
and if you need the base url then use urllib.parse
(or similar) to get it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.