简体   繁体   中英

python and json UTF-8 encoding

I am currently facing some issues about encoding. As I am French, I frequently use characters like é or è .

I am trying to figure out why they are not displayed in a JSON file I created automatically with scrapy ...

Here is my python code :

# -*- coding: utf-8 -*-

import scrapy


class BlogSpider(scrapy.Spider):
    name = 'pokespider'
    start_urls = [
        "https://www.pokepedia.fr/Liste_des_Pok%C3%A9mon_par_apport_en_EV"]

    def parse(self, response):
        for poke in response.css('table.tableaustandard.sortable tr')[1:]:
            num = poke.css('td ::text').extract_first()
            nom = poke.css('td:nth-child(3) a ::text').extract_first()

            yield {'numero': int(num), 'nom': nom}

Then, after typing the scrapy command, the code produces a JSON file. Here are its first lines :

[
{"numero": 1, "nom": "Bulbizarre"},
{"numero": 2, "nom": "Herbizarre"},
{"numero": 3, "nom": "Florizarre"},
{"numero": 4, "nom": "Salam\u00e8che"},
...
]

(Yes, these are French Pokémon names.)

So, I would like to get rid of this character, it should be an è ... Is there a way to do this?

Thank you in advance, and I hope my English is not too poor :)

Use FEED_EXPORT_ENCODING option: here in custom_settings.

import scrapy
  
class BlogSpider(scrapy.Spider):
    name = 'pokespider'
    custom_settings = {'FEED_EXPORT_ENCODING': 'utf-8'}
    start_urls = [
        "https://www.pokepedia.fr/Liste_des_Pok%C3%A9mon_par_apport_en_EV"]

    def parse(self, response):
        for poke in response.css('table.tableaustandard.sortable tr')[1:]:
            num = poke.css('td ::text').extract_first()
            nom = poke.css('td:nth-child(3) a ::text').extract_first()

            yield {'numero': int(num), 'nom': nom}

process = CrawlerProcess(settings={
    "FEEDS": {
        "items_json": {"format": "json"},
    },
})

process.crawl(BlogSpider)
process.start()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM