简体   繁体   中英

After exporting scraped data to csv via Scrapy (Python), I'm getting characters like †in the file

I wrote a spider in Scrapy to extract data from quotes.toscrape.com, but when I exported the extracted data to csv, the " (quote symbol) is converting itself to characters like â€

Here is the code written under spider as can be seen on sublime text3 on a windows machine.

# -*- coding: utf-8 -*-
import scrapy


class TestSpider(scrapy.Spider):
    name = 'Test'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.xpath('//*[@class="quote"]')
        for quote in quotes:
            text = quote.xpath('.//*[@class="text"]/text()').extract_first()
            author = quote.xpath('.//*[@class="author"]/text()').extract_first()
            tags = quote.xpath('.//*[@itemprop="keywords"]/@content').extract_first()
            yield{"Text": text, "Author": author, "Tags": tags}
        next_p = response.xpath('//*[@class="next"]/a/@href').extract_first()
        absolute_n = response.urljoin(next_p)
        yield scrapy.Request(absolute_n)

Also, here is the command I used to export the data which is defined as in class dictionary to a csv file.(This was run via scrapy shell under windows command prompt)

scrapy crawl Test -o scraped.csv

And this is how, I have received the data in csv file.

Please help me resolve that treating me like a beginner.

That sequence of mojibake looks like what you get if you encode smart quotes (like '“`, U+201C) as UTF-8 and then try to decode them as ISO Latin 9, Windows-1252, or something else that's similar to Latin-1 but has a Euro symbol. For example:

>>> print('\u201c'.encode('utf-8').decode('iso-8859-9')
â

There are two likely places things could be going wrong. Since you haven't shown us the raw bytes at any step in the process, or any of your code, it's impossible to know which of the two is going wrong, but I can explain how to deal with both of them.


First, you could be decoding the HTML response that contains these quotes as Latin-9 or whatever, even though it's encoded in UTF-8.

If you're doing this explicitly, just stop doing that.

But more likely, you're getting, eg, a TextResponse from Scrapy and just accessing resp.text , and the page had an incorrect header or meta tag or the like, causing Scrapy to mis-decode it.

To fix this, you want to access the raw bytes and decode them explicitly. So, if you were using resp.text , you'd do resp.body.decode('utf8') instead.


Alternatively, you could be decoding the HTML fine, and encoding the CSV fine, and you're just opening that CSV as Latin-9 instead of UTF-8. In which case there's nothing to change in your code; you just need to look at the settings of your spreadsheet program.

However, if you're on Windows, a lot of Windows software (especially from Microsoft) makes some weird assumptions. By default, a text file is assumed to be encoded in the OEM codepage, which is usually something like Windows-1252. To override this and force UTF-8, you're expected to include a "byte order mark". This isn't really a byte order mark (because that makes no sense for 8-bit encodings), and it's strongly discouraged by the standards for UTF-8, but Microsoft does it anyway).

So, if you're using Excel on Windows, and you don't want to change the settings, you can work around Microsoft's problem by writing the file with the utf-8-sig encoding instead of utf-8 , which will force this "BOM" to be written:

with open('outfile.csv', 'w', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    # etc.

Since you appear to be creating your export pipeline just by passing -o csv to the scrapy crawl command, I believe you need to set FEED_EXPORT_ENCODING either in your config file (by editing settings.py or using the scrapy settings command), on the crawl command line ( -set FEED_EXPORT_ENDCODING=utf-8-sig ), or in an environment variable ( SET FEED_EXPORT_ENDCODING=utf-8-sig in the cmd console window before you scrapy crawl ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM