I've got a question about parsing e-mail in various sites by means Scrapy.
I have such spider:
from scrapy.contrib.spiders import CrawlSpider
from sufio.items import MItem
class MSpider(CrawlSpider):
name = 'mparser'
start_urls = [
'https://horizonsupply.myshopify.com/pages/about-us',
'https://fnatic-shop.myshopify.com/pages/about-us',
'https://horizonsupply.myshopify.com/pages/about-us',
'https://fnatic-shop.myshopify.com/pages/about-us'
]
def parse(self, response):
item = MItem()
item['facebook'] = response.xpath('//a[contains(@href, "facebook")]/@href').extract_first()
item['twitter'] = response.xpath('//a[contains(@href, "twitter")]/@href').extract_first()
# item['email'] =
yield item
I need to follow each link and check, if there is email. Is it possible to perform by means scrapy?
I use something like this:
mails = response.xpath('//a[contains(@href, "mailto:")]/@href').extract()
mails += response.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)][contains(.,"@")] | '
'//a[contains(./@href,"@")]/@href').extract()
for a in response.xpath('//a[contains(text(),"@")]'):
ma = ''.join(a.xpath('./text()').extract())
mails.append(ma)
But after this, i use additional function for remove duplicate and invalid rows.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.