Thing I want to do
I want to make json file using Python's Scrapy spider. I am currently studying at "Data Visualization with Python and JavaScript". In scraping, it is unknown why the json file is not created.
Directory structure
/root
nobel_winners scrapy.cfg
/nobel_winners:
__init__.py items.py pipelines.py spiders
__pycache__ middlewares.py settings.py
/nobel_winners/spiders:
__init__.py __pycache__ nwinners_list_spider.py
Working process/Code
Enter the following code in nwinners_list_spider.py in / nobel_winners / spiders.
#encoding:utf-8
import scrapy
class NWinnerItem(scrapy.Item):
country = scrapy.Field()
class NWinnerSpider(scrapy.Spider):
name = 'nwinners_list'
allowed_domains = ['en.wikipedia.org']
start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]
def parse(self, response):
h2s = response.xpath('//h2')
for h2 in h2s:
country = h2.xpath('span[@class="mw-headline"]/text()').extract()
Enter the following code in the root directory.
scrapy crawl nwinners_list -o nobel_winners.json
Error
The following display appears and no data is entered in the json file.
2018-07-25 10:01:53 [scrapy.core.engine] INFO: Spider opened
2018-07-25 10:01:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
What I tried
1.In the text, it was a bit longer source, but I checked it out only for 'country' variables.
2.I entered the scrapy shell and checked the movements of each one using IPython based shell. And It was confirmed that the value was firmly in 'country'.
h2s = response.xpath('//h2')
for h2 in h2s:
country = h2.xpath('span[@class="mw-headline"]/text()').extract()
print(country)
Try using this code:
import scrapy
class NWinnerItem(scrapy.Item):
country = scrapy.Field()
class NWinnerSpider(scrapy.Spider):
name = 'nwinners_list'
allowed_domains = ['en.wikipedia.org']
start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]
def parse(self, response):
h2s = response.xpath('//h2')
for h2 in h2s:
yield NWinnerItem(
country = h2.xpath('span[@class="mw-headline"]/text()').extract_first()
)
And then run scrapy crawl nwinners_list -o nobel_winners.json -t json
In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects . See scrapy documentation
This is the reason why there is 0 item scraped, you need to return them !
Also note that .extract()
return a list based on your xpath query and .extract_first()
returns the first element of the list.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.