简体   繁体   中英

Scrapy exporting weird symbols into csv file

Ok, so here's the issue. I'm a beginner who has just started to delve into scrapy/python.

I use the code below to scrape a website and save the results into a csv. When I look in the command prompt, it turns words like Officiële into Offici\\xele. In the csv file, it changes it to officiële. I think this is because it's saving in unicode instead of UTF-8? I however have 0 clue how to change my code, and I've been trying all morning so far.

Could anyone help me out here? I'm specifically looking at making sure item["publicatietype"] works properly. How can I encode/decode it? What do I need to write? I tried using replace('ë', 'ë'), but that gives me an error (non-ASCCI character, but no encoding declared).

class pagespider(Spider):
    name = "OBSpider"
    #max_page is put here to prevent endless loops; make it as large as you need. It will try and go up to that page
    #even if there's nothing there. A number too high will just take way too much time and yield no results
    max_pages = 1

    def start_requests(self):
        for i in range(self.max_pages):
            yield scrapy.Request("https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=%d&sorttype=1&sortorder=4" % (i+1), callback = self.parse)


    def parse(self, response):
        for sel in response.xpath('//div[@class = "lijst"]/ul/li'):
            item = ThingsToGather()
            item["titel"] = ' '.join(sel.xpath('a/text()').extract())
            deeplink = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('a/@href').extract())])
            request = scrapy.Request(deeplink, callback=self.get_page_info)
            request.meta['item'] = item
            yield request

    def get_page_info(self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']

    #it loads some general info from the header. If this string is less than 5 characters, the site probably is a faulthy link (i.e. an error 404). If this is the case, then it drops the item. Else it continues

            if len(' '.join(sel.xpath('//div[contains(@class, "logo-nummer")]/div[contains(@class, "nummer")]/text()').extract())) < 5:
                raise DropItem()
            else:
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
                item["publicatietype"] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item = self.__normalise_item(item, response.url)

    #if the string is less than 5, then the required data is not on the page. It then needs to be
    #retrieved from the technical information link. If it's the proper link (the else clause), you're done and it proceeds to 'else'
                if len(item['publicatiedatum']) < 5:
                    tech_inf_link = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('//*[@id="technischeInfoHyperlink"]/@href').extract())])
                    request = scrapy.Request(tech_inf_link, callback=self.get_date_info)
                    request.meta['item'] = item
                    yield request 
                else:
                    yield item

    def get_date_info (self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']
            item["filename"] = sel.xpath('//span[contains(@property, "http://standaarden.overheid.nl/oep/meta/publicationName")]/text()').extract()
            item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
            item['publicatietype'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
            item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
            item = self.__normalise_item(item, response.url)    
            return item

    # commands below are intended to clean up strings. Everything is sent to __normalise_item to clean unwanted characters (strip) and double spaces (split)

    def __normalise_item(self, item, base_url):
        for key, value in vars(item).values()[0].iteritems():
            item[key] = self.__normalise(item[key])

        item ['titel']= item['titel'].replace(';', '& ')
        return item

    def __normalise(self, value):
        value = value if type(value) is not list else ' '.join(value)
        value = value.strip()
        value = " ".join(value.split())
        return value

ANSWER:

See comment by paul trmbrth below. The problem is not scrapy, it's excel.

For anyone coming across this question as well. The tldr is: import the data in excel (in the data menu in the ribbon) and switch Windows (ANSI) or whatever it is on to Unicode (UTF-8).

Officiële will be represented as u'Offici\\xeble' in Python 2, as seen in the python shell session example below (no need to worry about the \\xXX characters, it's just how Python represents non-ASCII Unicode characters)

$ python
Python 2.7.9 (default, Apr  2 2015, 15:33:21) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'Officiële'
u'Offici\xeble'
>>> u'Offici\u00EBle'
u'Offici\xeble'
>>> 

I think this is because it's saving in unicode instead of UTF-8

UTF-8 is an encoding, Unicode is not.

ë , aka U+00EB , aka LATIN SMALL LETTER E WITH DIAERESIS , will be UTF-8 encoded as 2 bytes, \\xc3 and \\xab

>>> u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>> 

In the csv file, it changes it to officiële.

If you see this, it's probably that you need to set the input encoding to UTF-8 when opening the CSV file inside your program.

Scrapy CSV exporter will write Python Unicode strings as UTF-8 encoded strings in the output file.

Scrapy selectors will output Unicode strings:

$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]: 
[u'Offici\xeble bekendmakingen vandaag',
 u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
 u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009']

In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
    print t
   ...:     
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009

Let's see what a spider extracting these strings in items will get you as CSV:

$ cat testspider.py
import scrapy


class TestSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']

    def parse(self, response):
        for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
            yield {"link": t}

Run the spider and ask for CSV output:

$ scrapy runspider testspider.py -o test.csv
2016-03-15 11:00:13 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:00:13 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines: 
2016-03-15 11:00:14 [scrapy] INFO: Spider opened
2016-03-15 11:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-15 11:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 11:00:14 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Offici\xeble bekendmakingen vandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 12018,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
 'item_scraped_count': 3,
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO: Spider closed (finished)

Check content of CSV file:

$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
"Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009"
$ hexdump -C test.csv 
00000000  6c 69 6e 6b 0d 0a 4f 66  66 69 63 69 c3 ab 6c 65  |link..Offici..le|
00000010  20 62 65 6b 65 6e 64 6d  61 6b 69 6e 67 65 6e 20  | bekendmakingen |
00000020  76 61 6e 64 61 61 67 0d  0a 55 69 74 6c 65 67 20  |vandaag..Uitleg |
00000030  6e 69 65 75 77 65 20 6e  75 6d 6d 65 72 69 6e 67  |nieuwe nummering|
00000040  20 48 61 6e 64 65 6c 69  6e 67 65 6e 20 76 61 6e  | Handelingen van|
00000050  61 66 20 31 20 6a 61 6e  75 61 72 69 20 32 30 31  |af 1 januari 201|
00000060  31 0d 0a 22 55 69 74 6c  65 67 20 6e 69 65 75 77  |1.."Uitleg nieuw|
00000070  65 0d 0a 20 20 20 20 20  20 20 20 20 20 20 20 6e  |e..            n|
00000080  75 6d 6d 65 72 69 6e 67  20 53 74 61 61 74 73 63  |ummering Staatsc|
00000090  6f 75 72 61 6e 74 20 76  61 6e 61 66 20 31 20 6a  |ourant vanaf 1 j|
000000a0  75 6c 69 20 32 30 30 39  22 0d 0a                 |uli 2009"..|
000000ab

You can verify that ë is correctly encoded as c3 ab

I can see the file data correctly when using LibreOffice for example (notice "Character set: Unicode UTF-8"):

在 LibreOffice 中打开 test.csv

You are probably using Latin-1. Here's what you get when using Latin-1 instead of UTF-8 as input encoding setting (in LibreOffice again):

在此处输入图片说明

To encode a string you can directly use encode("utf-8") . Something like this:

item['publicatiedatum'] = ''.join(sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()).encode("utf-8")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM