简体   繁体   中英

Using Scrapy to scrape data

I am trying to scrape data using scrapy. But having trouble in editing the code. Here is what I have done as an experiment:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://anon.example.com/']

    def parse(self, response):
        for title in response.css('h2'):
            yield {'Agent-name': title.css('a ::text').extract_first()}

        next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

I have used the example from website scrapy.org and try to modify it. What this code is doing is extracting the names of all the agents from the given page.
But I want that scrapy should go inside the page of the agent and extract its information from there.
Say for example:

Name: name of the agent
Phone: Phone Number
Email: email address
website: URL of website .. etc  

Hope this clarifies my problem. I would like to have a solution for this problem.

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://anon.example.com']


    # get 502 url of name
    def parse(self, response):
        info_urls = response.xpath('//div[@class="text"]//a/@href').extract()
        for info_url in info_urls:
            yield scrapy.Request(url=info_url, callback=self.parse_inof)
    # visit each url and get info
    def parse_inof(self, response):
        info = {}
        info['name'] = response.xpath('//h2/text()').extract_first()
        info['phone'] = response.xpath('//text()[contains(.,"Phone:")]').extract_first()
        info['email'] = response.xpath('//*[@class="cs-user-info"]/li[1]/text()').extract_first()
        info['website'] = response.xpath('//*[@class="cs-user-info"]/li[2]/a/text()').extract_first()
        print(info)

The name can be found in the detail page, so in first step, we just collect all the detail url.

Then we visit all the url and get all the info.

The date may need clean-up, but the idea is clear.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM