简体   繁体   English

Scrapy 将奇怪的符号导出到 csv 文件中

[英]Scrapy exporting weird symbols into csv file

Ok, so here's the issue.好的,问题来了。 I'm a beginner who has just started to delve into scrapy/python.我是一个刚开始深入研究scrapy/python的初学者。

I use the code below to scrape a website and save the results into a csv.我使用下面的代码来抓取网站并将结果保存到 csv 中。 When I look in the command prompt, it turns words like Officiële into Offici\\xele.当我查看命令提示符时,它会将诸如 Officiële 之类的词转换为 Offici\\xele。 In the csv file, it changes it to officiële.在 csv 文件中,它将其更改为 officiéle。 I think this is because it's saving in unicode instead of UTF-8?我认为这是因为它保存在 unicode 而不是 UTF-8 中? I however have 0 clue how to change my code, and I've been trying all morning so far.然而,我有 0 线索如何更改我的代码,到目前为止我一直在尝试整个上午。

Could anyone help me out here?有人可以帮我吗? I'm specifically looking at making sure item["publicatietype"] works properly.我特别关注确保 item["publicatietype"] 正常工作。 How can I encode/decode it?我如何编码/解码它? What do I need to write?我需要写什么? I tried using replace('ë', 'ë'), but that gives me an error (non-ASCCI character, but no encoding declared).我尝试使用 replace('ë', 'ë'),但这给了我一个错误(非 ASCCI 字符,但没有声明编码)。

class pagespider(Spider):
    name = "OBSpider"
    #max_page is put here to prevent endless loops; make it as large as you need. It will try and go up to that page
    #even if there's nothing there. A number too high will just take way too much time and yield no results
    max_pages = 1

    def start_requests(self):
        for i in range(self.max_pages):
            yield scrapy.Request("https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=%d&sorttype=1&sortorder=4" % (i+1), callback = self.parse)


    def parse(self, response):
        for sel in response.xpath('//div[@class = "lijst"]/ul/li'):
            item = ThingsToGather()
            item["titel"] = ' '.join(sel.xpath('a/text()').extract())
            deeplink = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('a/@href').extract())])
            request = scrapy.Request(deeplink, callback=self.get_page_info)
            request.meta['item'] = item
            yield request

    def get_page_info(self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']

    #it loads some general info from the header. If this string is less than 5 characters, the site probably is a faulthy link (i.e. an error 404). If this is the case, then it drops the item. Else it continues

            if len(' '.join(sel.xpath('//div[contains(@class, "logo-nummer")]/div[contains(@class, "nummer")]/text()').extract())) < 5:
                raise DropItem()
            else:
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
                item["publicatietype"] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
                item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
                item = self.__normalise_item(item, response.url)

    #if the string is less than 5, then the required data is not on the page. It then needs to be
    #retrieved from the technical information link. If it's the proper link (the else clause), you're done and it proceeds to 'else'
                if len(item['publicatiedatum']) < 5:
                    tech_inf_link = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('//*[@id="technischeInfoHyperlink"]/@href').extract())])
                    request = scrapy.Request(tech_inf_link, callback=self.get_date_info)
                    request.meta['item'] = item
                    yield request 
                else:
                    yield item

    def get_date_info (self, response):
        for sel in response.xpath('//*[@id="Inhoud"]'):
            item = response.meta['item']
            item["filename"] = sel.xpath('//span[contains(@property, "http://standaarden.overheid.nl/oep/meta/publicationName")]/text()').extract()
            item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
            item['publicatietype'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
            item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
            item = self.__normalise_item(item, response.url)    
            return item

    # commands below are intended to clean up strings. Everything is sent to __normalise_item to clean unwanted characters (strip) and double spaces (split)

    def __normalise_item(self, item, base_url):
        for key, value in vars(item).values()[0].iteritems():
            item[key] = self.__normalise(item[key])

        item ['titel']= item['titel'].replace(';', '& ')
        return item

    def __normalise(self, value):
        value = value if type(value) is not list else ' '.join(value)
        value = value.strip()
        value = " ".join(value.split())
        return value

ANSWER:回答:

See comment by paul trmbrth below.请参阅下面 paul trmbrth 的评论。 The problem is not scrapy, it's excel.问题不是scrapy,而是excel。

For anyone coming across this question as well.对于任何遇到这个问题的人。 The tldr is: import the data in excel (in the data menu in the ribbon) and switch Windows (ANSI) or whatever it is on to Unicode (UTF-8). tldr 是:在 excel 中导入数据(在功能区的数据菜单中)并将 Windows (ANSI) 或任何它打开到 Unicode (UTF-8)。

Officiële will be represented as u'Offici\\xeble' in Python 2, as seen in the python shell session example below (no need to worry about the \\xXX characters, it's just how Python represents non-ASCII Unicode characters) Officiële将在 Python 2 中表示为u'Offici\\xeble' ,如下面的 python shell 会话示例所示(无需担心\\xXX字符,这就是 Python 表示非 ASCII Unicode 字符的方式)

$ python
Python 2.7.9 (default, Apr  2 2015, 15:33:21) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'Officiële'
u'Offici\xeble'
>>> u'Offici\u00EBle'
u'Offici\xeble'
>>> 

I think this is because it's saving in unicode instead of UTF-8我认为这是因为它保存在 unicode 而不是 UTF-8

UTF-8 is an encoding, Unicode is not. UTF-8是一种编码, Unicode不是。

ë , aka U+00EB , aka LATIN SMALL LETTER E WITH DIAERESIS , will be UTF-8 encoded as 2 bytes, \\xc3 and \\xab ë ,又名U+00EB ,又名LATIN SMALL LETTER E WITH DIAERESIS ,将被 UTF-8 编码为 2 个字节, \\xc3\\xab

>>> u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>> 

In the csv file, it changes it to officiële.在 csv 文件中,它将其更改为 officiéle。

If you see this, it's probably that you need to set the input encoding to UTF-8 when opening the CSV file inside your program.如果您看到这一点,则可能是您在程序中打开 CSV 文件时需要将输入编码设置为 UTF-8。

Scrapy CSV exporter will write Python Unicode strings as UTF-8 encoded strings in the output file. Scrapy CSV 导出器将 Python Unicode 字符串作为 UTF-8 编码字符串写入输出文件。

Scrapy selectors will output Unicode strings: Scrapy 选择器将输出 Unicode 字符串:

$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]: 
[u'Offici\xeble bekendmakingen vandaag',
 u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
 u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009']

In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
    print t
   ...:     
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009

Let's see what a spider extracting these strings in items will get you as CSV:让我们看看在项目中提取这些字符串的蜘蛛会让你得到什么 CSV:

$ cat testspider.py
import scrapy


class TestSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']

    def parse(self, response):
        for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
            yield {"link": t}

Run the spider and ask for CSV output:运行蜘蛛并请求 CSV 输出:

$ scrapy runspider testspider.py -o test.csv
2016-03-15 11:00:13 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:00:13 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines: 
2016-03-15 11:00:14 [scrapy] INFO: Spider opened
2016-03-15 11:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-15 11:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 11:00:14 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Offici\xeble bekendmakingen vandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe\r\n            nummering Staatscourant vanaf 1 juli 2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 12018,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
 'item_scraped_count': 3,
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO: Spider closed (finished)

Check content of CSV file:检查 CSV 文件的内容:

$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
"Uitleg nieuwe
            nummering Staatscourant vanaf 1 juli 2009"
$ hexdump -C test.csv 
00000000  6c 69 6e 6b 0d 0a 4f 66  66 69 63 69 c3 ab 6c 65  |link..Offici..le|
00000010  20 62 65 6b 65 6e 64 6d  61 6b 69 6e 67 65 6e 20  | bekendmakingen |
00000020  76 61 6e 64 61 61 67 0d  0a 55 69 74 6c 65 67 20  |vandaag..Uitleg |
00000030  6e 69 65 75 77 65 20 6e  75 6d 6d 65 72 69 6e 67  |nieuwe nummering|
00000040  20 48 61 6e 64 65 6c 69  6e 67 65 6e 20 76 61 6e  | Handelingen van|
00000050  61 66 20 31 20 6a 61 6e  75 61 72 69 20 32 30 31  |af 1 januari 201|
00000060  31 0d 0a 22 55 69 74 6c  65 67 20 6e 69 65 75 77  |1.."Uitleg nieuw|
00000070  65 0d 0a 20 20 20 20 20  20 20 20 20 20 20 20 6e  |e..            n|
00000080  75 6d 6d 65 72 69 6e 67  20 53 74 61 61 74 73 63  |ummering Staatsc|
00000090  6f 75 72 61 6e 74 20 76  61 6e 61 66 20 31 20 6a  |ourant vanaf 1 j|
000000a0  75 6c 69 20 32 30 30 39  22 0d 0a                 |uli 2009"..|
000000ab

You can verify that ë is correctly encoded as c3 ab您可以验证ë是否正确编码为c3 ab

I can see the file data correctly when using LibreOffice for example (notice "Character set: Unicode UTF-8"):例如,我可以在使用 LibreOffice 时正确查看文件数据(注意“字符集:Unicode UTF-8”):

在 LibreOffice 中打开 test.csv

You are probably using Latin-1.您可能正在使用 Latin-1。 Here's what you get when using Latin-1 instead of UTF-8 as input encoding setting (in LibreOffice again):以下是使用 Latin-1 而不是 UTF-8 作为输入编码设置时的结果(再次在 LibreOffice 中):

在此处输入图片说明

To encode a string you can directly use encode("utf-8") .要对字符串进行编码,您可以直接使用encode("utf-8") Something like this:像这样的东西:

item['publicatiedatum'] = ''.join(sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()).encode("utf-8")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM