简体   繁体   English

JSON文件不是使用Python Scrapy Spider创建的

[英]json file is not created with Python Scrapy Spider

Thing I want to do 我想做的事

I want to make json file using Python's Scrapy spider. 我想使用Python的Scrapy Spider制作json文件。 I am currently studying at "Data Visualization with Python and JavaScript". 我目前正在研究“使用Python和JavaScript进行数据可视化”。 In scraping, it is unknown why the json file is not created. 在抓取中,未知为什么不创建json文件。

Directory structure 目录结构

/root
nobel_winners   scrapy.cfg

/nobel_winners:
__init__.py     items.py    pipelines.py    spiders
__pycache__     middlewares.py    settings.py

/nobel_winners/spiders:
__init__.py     __pycache__     nwinners_list_spider.py

Working process/Code 工作流程/代码

Enter the following code in nwinners_list_spider.py in / nobel_winners / spiders. 在/ nobel_winners / spiders的nwinners_list_spider.py中输入以下代码。

#encoding:utf-8

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            country = h2.xpath('span[@class="mw-headline"]/text()').extract()

Enter the following code in the root directory. 在根目录中输入以下代码。

scrapy crawl nwinners_list -o nobel_winners.json

Error 错误

The following display appears and no data is entered in the json file. 出现以下显示,并且没有在json文件中输入任何数据。

2018-07-25 10:01:53 [scrapy.core.engine] INFO: Spider opened
2018-07-25 10:01:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

What I tried 我尝试了什么

1.In the text, it was a bit longer source, but I checked it out only for 'country' variables. 1.在文本中,它是更长的来源,但是我只检查了“国家”变量。

2.I entered the scrapy shell and checked the movements of each one using IPython based shell. 2.我进入了scrapy外壳,并使用基于IPython的外壳检查了每个外壳的运动。 And It was confirmed that the value was firmly in 'country'. 并确认了该值牢固地位于“国家”中。

h2s = response.xpath('//h2')

for h2 in h2s:
    country = h2.xpath('span[@class="mw-headline"]/text()').extract()
    print(country)

Try using this code: 尝试使用以下代码:

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            yield NWinnerItem(
                country = h2.xpath('span[@class="mw-headline"]/text()').extract_first()
            )

And then run scrapy crawl nwinners_list -o nobel_winners.json -t json 然后运行scrapy crawl nwinners_list -o nobel_winners.json -t json


In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects . 在回调函数中,您解析响应(网页)并返回带有提取数据,Item对象,Request对象或这些对象的可迭代对象的dict See scrapy documentation 见报废文件

This is the reason why there is 0 item scraped, you need to return them ! 这就是为什么刮掉0件物品的原因,您需要将它们退回!

Also note that .extract() return a list based on your xpath query and .extract_first() returns the first element of the list. 还要注意的是.extract()根据您的XPath查询并返回一个列表.extract_first()返回列表的第一个元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM