使用Scrapy解析表页面并从基础链接中提取数据

Question

I am trying to scrape the underlying data on the table in the following pages: https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries 我正在尝试在以下页面的表上抓取基础数据： https : //www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries

What I want to do is access the underlying link for each row, and capture: 我想做的是访问每一行的基础链接，并捕获：

The ID tag (eg QDE001), ID标签（例如QDE001），
The name 名字
The reason for listing / additional information 列出原因/其他信息
Other linked entities 其他关联实体

This is what I have, but it does not seems to be working, I keep getting a "NotImplementedError('{}.parse callback is notdefined'.format(self. class . name )).I believe that the Xpaths I have defined are OK, not sure what I am missing. 这是我有什么，但似乎并不奏效，我不断收到一个“NotImplementedError（“{}解析回调notdefined'.format（自我。类。名））。我相信，我的XPath定义还可以，不确定我缺少什么。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class UNSCItem(scrapy.Item):
    name = scrapy.Field()
    uid = scrapy.Field()
    link = scrapy.Field()
    reason = scrapy.Field()
    add_info = scrapy.Field()



class UNSC(scrapy.Spider):
    name = "UNSC"
    start_urls = [
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',      
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]

    rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')


    def data_extract(self, response):
        item = UNSCItem()
        name = response.xpath('//*[@id="content"]/article/div[3]/div//text()').extract()
        uid = response.xpath('//*[@id="content"]/article/div[2]/div/div//text()').extract()
        reason =  response.xpath('//*[@id="content"]/article/div[6]/div[2]/div//text()').extract() 
        add_info = response.xpath('//*[@id="content"]/article/div[7]//text()').extract()
        related = response.xpath('//*[@id="content"]/article/div[8]/div[2]//text()').extract()
        yield item

Answer 1

Try the below approach. 请尝试以下方法。 It should fetch you all the ids and corresponding names from all the six pages. 它应该从所有六个页面中获取所有ids和相应的names 。 I suppose, the rest of the fields you can manage yourself. 我想，您可以自行管理其余领域。

Just run it as it is: 照原样运行：

import scrapy

class UNSC(scrapy.Spider):
    name = "UNSC"

    start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]

    def parse(self, response):
        for item in response.xpath('//*[contains(@class,"views-table")]//tbody//tr'):
            idnum = item.xpath('.//*[contains(@class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
            name = item.xpath('.//*[contains(@class,"views-field-title")]//span[@dir="ltr"]/text()').extract()[-1].strip()
            yield{'ID':idnum,'Name':name}

使用Scrapy解析表页面并从基础链接中提取数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-06-25 20:34:57

使用Scrapy解析表页面并从基础链接中提取数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-06-25 20:34:57

解决方案1
2 已采纳 2018-06-25 20:34:57