简体   繁体   English

读取CSV文件(UTF-8)并输出和合并UTF-8字符串

[英]Reading a csv file (utf-8) and outputting and merging utf-8 strings

I have some problems with encoding utf-8 when I read and write a file. 读写文件时,我在编码utf-8时遇到一些问题。 I have a CSV file containing Danish and Swedish Letters (ÅÄÖ etc). 我有一个包含丹麦语和瑞典语字母(ÅÄÖ等)的CSV文件。 I want to read this file and extract a field - and manipulate the data (to create urls). 我想读取此文件并提取一个字段-并处理数据(以创建url)。

What I am struggling with is the following: 我正在努力的是以下几点:

  • I cannot read a file containing utf-8 letters - python outputs \\xd6 instead of ö . 我无法读取包含utf-8字母的文件\\xd6输出\\xd6而不是ö
  • I cannot merge two strings even though I am decoding them as (utf-8) 即使将两个字符串解码为(utf-8),我也无法合并两个字符串

I have tried the following: 我尝试了以下方法:

  • adding # -*- coding: utf-8 -*- 添加# -*- coding: utf-8 -*-
  • Companies = codecs.open("Axel_List.csv", "r", "utf-8") (reading the file with codecs lib), which produces this error - 'utf' codec can't byte 0xe4 in position 0 Companies = codecs.open("Axel_List.csv", "r", "utf-8") (使用编解码器lib读取文件),会产生此错误'utf' codec can't byte 0xe4 in position 0
  • url=u'http://www.proff.se/bransch-sök?q=' and url='http://www.proff.se/bransch-sök?q=' followed by url.decode('utf-8') which produces the same error when I try to join the two strings: url=u'http://www.proff.se/bransch-sök?q='url='http://www.proff.se/bransch-sök?q='后跟url.decode('utf-8')在尝试连接两个字符串时会产生相同的错误:
    UnicodeEncodeError 'ascii codec can't encode character u'\\xf6 in position 29

I can print the Company (even though they do not contain the correct letters) and the url separately, so there is something going on when I am joining them. 我可以分别打印公司(即使它们不包含正确的字母)和url,因此加入它们时发生了某些情况。

# -*- coding: utf-8 -*-
import re
import codecs
import os, sys
Google_urls=open('google_Urls','w')
Proff_urls=open('Proff_Urls','w')
Companies=("Company_List.csv")

for line in Companies:
    fields = line.split(",")
        if fields[10]=="Sweden":
            Company=(fields[1]).split("/v")
            Company=str(Company).replace('[',"")
            ... stripping and manipulating the records 
            ...
            Company=Company.decode('utf-8')
            url='http://www.proff.se/bransch-sök?q='
            url=url.decode('utf-8')
            Proff_se= ''.join((url,Company,"\n"))
            Proff_urls.write(Company) 
    else:
        continue 

Why I keep thinking there is something weird going on when I am reading the file is that I have tested this, and it works fine. 为什么我在阅读文件时一直认为发生了一些奇怪的事情,是因为我已经对此进行了测试,所以效果很好。

# coding=utf-8
Svenska="äöå"
Dan_Nor="æøå"
Svenska=Svenska.decode('utf-8')
Dan_Nor=Dan_Nor.decode('utf-8')
string3 ="".join((Svenska,Dan_Norlow,Dan_NorCapital))
print string3

Thanks in advance, I have read a lot of questions related to these but I cannot really wrap my head around it. 在此先感谢您,我已经阅读了很多与此有关的问题,但我无法真正解决。

The problem is almost certainly that your files aren't actually UTF-8, so trying to read them as if they were UTF-8 is failing. 问题几乎可以肯定是您的文件实际上不是UTF-8,因此尝试像读取文件是UTF-8一样失败。 In particular, you claim that using codecs.open("Axel_List.csv", "r", "utf-8") and then reading the file gives you this error: 特别是,您声称使用codecs.open("Axel_List.csv", "r", "utf-8")然后读取文件会出现以下错误:

'utf' codec can't byte 0xe4 in position 0   

So, clearly, either it isn't really UTF-8, or it's corrupted. 因此,很明显,它不是真正的UTF-8,或者它已损坏。

Normally, it's hard to guess the encoding of a file without actually having the file. 通常,很难在没有实际文件的情况下猜测文件的编码。 But in this case, it's easy. 但是在这种情况下,这很容易。

Byte 0xe4 is ä in Latin-1 (ISO-8859-1). 字节0xe4在Latin-1(ISO-8859-1)中为ä And ä is the first character that your code is looking for. ä是代码要查找的第一个字符。 So, your file is probably Latin-1. 因此,您的文件可能是Latin-1。

The same byte is also ä in two other legacy encodings sometimes used in Scandinavia, Latin-4 and Latin-6 (ISO-8859-4 and -10), so your file could be one of these. 同样的字节也ä有时在斯堪的纳维亚半岛,拉丁-4和拉丁语-6(ISO-8859-4和-10),这样你的文件可能是其中之一使用了两个其他传统编码。

In UTF-8, 0xe4 is a lead-byte for CJK characters. 在UTF-8中,0xe4是CJK字符的前导字节。 Unless you suspect that you really have a corrupted Japanese text file rather than a valid Swedish one, your file is definitely not UTF-8. 除非您怀疑自己确实有损坏的日语文本文件而不是有效的瑞典语文件,否则您的文件肯定不是UTF-8。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM