简体   繁体   中英

json file is not created with Python Scrapy Spider

Thing I want to do

I want to make json file using Python's Scrapy spider. I am currently studying at "Data Visualization with Python and JavaScript". In scraping, it is unknown why the json file is not created.

Directory structure

/root
nobel_winners   scrapy.cfg

/nobel_winners:
__init__.py     items.py    pipelines.py    spiders
__pycache__     middlewares.py    settings.py

/nobel_winners/spiders:
__init__.py     __pycache__     nwinners_list_spider.py

Working process/Code

Enter the following code in nwinners_list_spider.py in / nobel_winners / spiders.

#encoding:utf-8

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            country = h2.xpath('span[@class="mw-headline"]/text()').extract()

Enter the following code in the root directory.

scrapy crawl nwinners_list -o nobel_winners.json

Error

The following display appears and no data is entered in the json file.

2018-07-25 10:01:53 [scrapy.core.engine] INFO: Spider opened
2018-07-25 10:01:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

What I tried

1.In the text, it was a bit longer source, but I checked it out only for 'country' variables.

2.I entered the scrapy shell and checked the movements of each one using IPython based shell. And It was confirmed that the value was firmly in 'country'.

h2s = response.xpath('//h2')

for h2 in h2s:
    country = h2.xpath('span[@class="mw-headline"]/text()').extract()
    print(country)

Try using this code:

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            yield NWinnerItem(
                country = h2.xpath('span[@class="mw-headline"]/text()').extract_first()
            )

And then run scrapy crawl nwinners_list -o nobel_winners.json -t json


In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects . See scrapy documentation

This is the reason why there is 0 item scraped, you need to return them !

Also note that .extract() return a list based on your xpath query and .extract_first() returns the first element of the list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM