[英]Scrapy exporting weird symbols into csv file
好的,問題來了。 我是一個剛開始深入研究scrapy/python的初學者。
我使用下面的代碼來抓取網站並將結果保存到 csv 中。 當我查看命令提示符時,它會將諸如 Officiële 之類的詞轉換為 Offici\\xele。 在 csv 文件中,它將其更改為 officiéle。 我認為這是因為它保存在 unicode 而不是 UTF-8 中? 然而,我有 0 線索如何更改我的代碼,到目前為止我一直在嘗試整個上午。
有人可以幫我嗎? 我特別關注確保 item["publicatietype"] 正常工作。 我如何編碼/解碼它? 我需要寫什么? 我嘗試使用 replace('ë', 'ë'),但這給了我一個錯誤(非 ASCCI 字符,但沒有聲明編碼)。
class pagespider(Spider):
name = "OBSpider"
#max_page is put here to prevent endless loops; make it as large as you need. It will try and go up to that page
#even if there's nothing there. A number too high will just take way too much time and yield no results
max_pages = 1
def start_requests(self):
for i in range(self.max_pages):
yield scrapy.Request("https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=%d&sorttype=1&sortorder=4" % (i+1), callback = self.parse)
def parse(self, response):
for sel in response.xpath('//div[@class = "lijst"]/ul/li'):
item = ThingsToGather()
item["titel"] = ' '.join(sel.xpath('a/text()').extract())
deeplink = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('a/@href').extract())])
request = scrapy.Request(deeplink, callback=self.get_page_info)
request.meta['item'] = item
yield request
def get_page_info(self, response):
for sel in response.xpath('//*[@id="Inhoud"]'):
item = response.meta['item']
#it loads some general info from the header. If this string is less than 5 characters, the site probably is a faulthy link (i.e. an error 404). If this is the case, then it drops the item. Else it continues
if len(' '.join(sel.xpath('//div[contains(@class, "logo-nummer")]/div[contains(@class, "nummer")]/text()').extract())) < 5:
raise DropItem()
else:
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
item["publicatietype"] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item = self.__normalise_item(item, response.url)
#if the string is less than 5, then the required data is not on the page. It then needs to be
#retrieved from the technical information link. If it's the proper link (the else clause), you're done and it proceeds to 'else'
if len(item['publicatiedatum']) < 5:
tech_inf_link = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('//*[@id="technischeInfoHyperlink"]/@href').extract())])
request = scrapy.Request(tech_inf_link, callback=self.get_date_info)
request.meta['item'] = item
yield request
else:
yield item
def get_date_info (self, response):
for sel in response.xpath('//*[@id="Inhoud"]'):
item = response.meta['item']
item["filename"] = sel.xpath('//span[contains(@property, "http://standaarden.overheid.nl/oep/meta/publicationName")]/text()').extract()
item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
item['publicatietype'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item = self.__normalise_item(item, response.url)
return item
# commands below are intended to clean up strings. Everything is sent to __normalise_item to clean unwanted characters (strip) and double spaces (split)
def __normalise_item(self, item, base_url):
for key, value in vars(item).values()[0].iteritems():
item[key] = self.__normalise(item[key])
item ['titel']= item['titel'].replace(';', '& ')
return item
def __normalise(self, value):
value = value if type(value) is not list else ' '.join(value)
value = value.strip()
value = " ".join(value.split())
return value
回答:
請參閱下面 paul trmbrth 的評論。 問題不是scrapy,而是excel。
對於任何遇到這個問題的人。 tldr 是:在 excel 中導入數據(在功能區的數據菜單中)並將 Windows (ANSI) 或任何它打開到 Unicode (UTF-8)。
Officiële
將在 Python 2 中表示為u'Offici\\xeble'
,如下面的 python shell 會話示例所示(無需擔心\\xXX
字符,這就是 Python 表示非 ASCII Unicode 字符的方式)
$ python
Python 2.7.9 (default, Apr 2 2015, 15:33:21)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'Officiële'
u'Offici\xeble'
>>> u'Offici\u00EBle'
u'Offici\xeble'
>>>
我認為這是因為它保存在 unicode 而不是 UTF-8
UTF-8是一種編碼, Unicode不是。
ë
,又名U+00EB
,又名LATIN SMALL LETTER E WITH DIAERESIS
,將被 UTF-8 編碼為 2 個字節, \\xc3
和\\xab
>>> u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>>
在 csv 文件中,它將其更改為 officiéle。
如果您看到這一點,則可能是您在程序中打開 CSV 文件時需要將輸入編碼設置為 UTF-8。
Scrapy CSV 導出器將 Python Unicode 字符串作為 UTF-8 編碼字符串寫入輸出文件。
Scrapy 選擇器將輸出 Unicode 字符串:
$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]:
[u'Offici\xeble bekendmakingen vandaag',
u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
u'Uitleg nieuwe\r\n nummering Staatscourant vanaf 1 juli 2009']
In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
print t
...:
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
nummering Staatscourant vanaf 1 juli 2009
讓我們看看在項目中提取這些字符串的蜘蛛會讓你得到什么 CSV:
$ cat testspider.py
import scrapy
class TestSpider(scrapy.Spider):
name = 'testspider'
start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']
def parse(self, response):
for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
yield {"link": t}
運行蜘蛛並請求 CSV 輸出:
$ scrapy runspider testspider.py -o test.csv
2016-03-15 11:00:13 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:00:13 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines:
2016-03-15 11:00:14 [scrapy] INFO: Spider opened
2016-03-15 11:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-15 11:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 11:00:14 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Offici\xeble bekendmakingen vandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe\r\n nummering Staatscourant vanaf 1 juli 2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12018,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
'item_scraped_count': 3,
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO: Spider closed (finished)
檢查 CSV 文件的內容:
$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
"Uitleg nieuwe
nummering Staatscourant vanaf 1 juli 2009"
$ hexdump -C test.csv
00000000 6c 69 6e 6b 0d 0a 4f 66 66 69 63 69 c3 ab 6c 65 |link..Offici..le|
00000010 20 62 65 6b 65 6e 64 6d 61 6b 69 6e 67 65 6e 20 | bekendmakingen |
00000020 76 61 6e 64 61 61 67 0d 0a 55 69 74 6c 65 67 20 |vandaag..Uitleg |
00000030 6e 69 65 75 77 65 20 6e 75 6d 6d 65 72 69 6e 67 |nieuwe nummering|
00000040 20 48 61 6e 64 65 6c 69 6e 67 65 6e 20 76 61 6e | Handelingen van|
00000050 61 66 20 31 20 6a 61 6e 75 61 72 69 20 32 30 31 |af 1 januari 201|
00000060 31 0d 0a 22 55 69 74 6c 65 67 20 6e 69 65 75 77 |1.."Uitleg nieuw|
00000070 65 0d 0a 20 20 20 20 20 20 20 20 20 20 20 20 6e |e.. n|
00000080 75 6d 6d 65 72 69 6e 67 20 53 74 61 61 74 73 63 |ummering Staatsc|
00000090 6f 75 72 61 6e 74 20 76 61 6e 61 66 20 31 20 6a |ourant vanaf 1 j|
000000a0 75 6c 69 20 32 30 30 39 22 0d 0a |uli 2009"..|
000000ab
您可以驗證ë
是否正確編碼為c3 ab
例如,我可以在使用 LibreOffice 時正確查看文件數據(注意“字符集:Unicode UTF-8”):
您可能正在使用 Latin-1。 以下是使用 Latin-1 而不是 UTF-8 作為輸入編碼設置時的結果(再次在 LibreOffice 中):
要對字符串進行編碼,您可以直接使用encode("utf-8")
。 像這樣的東西:
item['publicatiedatum'] = ''.join(sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()).encode("utf-8")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.