[英]Using Scrapy to parse table page and extract data from underlying links
我正在嘗試在以下頁面的表上抓取基礎數據: https : //www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries
我想做的是訪問每一行的基礎鏈接,並捕獲:
這是我有什么,但似乎並不奏效,我不斷收到一個“NotImplementedError(“{}解析回調notdefined'.format(自我。 類 。 名 ))。我相信,我的XPath定義還可以,不確定我缺少什么。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class UNSCItem(scrapy.Item):
name = scrapy.Field()
uid = scrapy.Field()
link = scrapy.Field()
reason = scrapy.Field()
add_info = scrapy.Field()
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = [
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]
rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')
def data_extract(self, response):
item = UNSCItem()
name = response.xpath('//*[@id="content"]/article/div[3]/div//text()').extract()
uid = response.xpath('//*[@id="content"]/article/div[2]/div/div//text()').extract()
reason = response.xpath('//*[@id="content"]/article/div[6]/div[2]/div//text()').extract()
add_info = response.xpath('//*[@id="content"]/article/div[7]//text()').extract()
related = response.xpath('//*[@id="content"]/article/div[8]/div[2]//text()').extract()
yield item
請嘗試以下方法。 它應該從所有六個頁面中獲取所有ids
和相應的names
。 我想,您可以自行管理其余領域。
照原樣運行:
import scrapy
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]
def parse(self, response):
for item in response.xpath('//*[contains(@class,"views-table")]//tbody//tr'):
idnum = item.xpath('.//*[contains(@class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
name = item.xpath('.//*[contains(@class,"views-field-title")]//span[@dir="ltr"]/text()').extract()[-1].strip()
yield{'ID':idnum,'Name':name}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.