JSON文件不是使用Python Scrapy Spider創建的

Question

我想做的事

我想使用Python的Scrapy Spider制作json文件。 我目前正在研究“使用Python和JavaScript進行數據可視化”。 在抓取中，未知為什么不創建json文件。

目錄結構

/root
nobel_winners   scrapy.cfg

/nobel_winners:
__init__.py     items.py    pipelines.py    spiders
__pycache__     middlewares.py    settings.py

/nobel_winners/spiders:
__init__.py     __pycache__     nwinners_list_spider.py

工作流程/代碼

在/ nobel_winners / spiders的nwinners_list_spider.py中輸入以下代碼。

#encoding:utf-8

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            country = h2.xpath('span[@class="mw-headline"]/text()').extract()

在根目錄中輸入以下代碼。

scrapy crawl nwinners_list -o nobel_winners.json

錯誤

出現以下顯示，並且沒有在json文件中輸入任何數據。

2018-07-25 10:01:53 [scrapy.core.engine] INFO: Spider opened
2018-07-25 10:01:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

我嘗試了什么

1.在文本中，它是更長的來源，但是我只檢查了“國家”變量。

2.我進入了scrapy外殼，並使用基於IPython的外殼檢查了每個外殼的運動。 並確認了該值牢固地位於“國家”中。

h2s = response.xpath('//h2')

for h2 in h2s:
    country = h2.xpath('span[@class="mw-headline"]/text()').extract()
    print(country)

Answer 1

嘗試使用以下代碼：

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            yield NWinnerItem(
                country = h2.xpath('span[@class="mw-headline"]/text()').extract_first()
            )

然后運行scrapy crawl nwinners_list -o nobel_winners.json -t json

在回調函數中，您解析響應（網頁）並返回帶有提取數據，Item對象，Request對象或這些對象的可迭代對象的dict 。 見報廢文件

這就是為什么刮掉0件物品的原因，您需要將它們退回！

還要注意的是.extract()根據您的XPath查詢並返回一個列表.extract_first()返回列表的第一個元素。

JSON文件不是使用Python Scrapy Spider創建的

問題描述

1 個解決方案

解決方案1
0 已采納 2018-07-28 14:27:46

JSON文件不是使用Python Scrapy Spider創建的

問題描述

1 個解決方案

解決方案1 0 已采納 2018-07-28 14:27:46

解決方案1
0 已采納 2018-07-28 14:27:46