I have some problems with encoding utf-8 when I read and write a file. I have a CSV file containing Danish and Swedish Letters (ÅÄÖ etc). I want to read this file and extract a field - and manipulate the data (to create urls).
What I am struggling with is the following:
\\xd6
instead of ö
. I have tried the following:
# -*- coding: utf-8 -*-
Companies = codecs.open("Axel_List.csv", "r", "utf-8")
(reading the file with codecs lib), which produces this error - 'utf' codec can't byte 0xe4 in position 0
url=u'http://www.proff.se/bransch-sök?q='
and url='http://www.proff.se/bransch-sök?q='
followed by url.decode('utf-8')
which produces the same error when I try to join the two strings: UnicodeEncodeError 'ascii codec can't encode character u'\\xf6 in position 29
I can print the Company (even though they do not contain the correct letters) and the url separately, so there is something going on when I am joining them.
# -*- coding: utf-8 -*-
import re
import codecs
import os, sys
Google_urls=open('google_Urls','w')
Proff_urls=open('Proff_Urls','w')
Companies=("Company_List.csv")
for line in Companies:
fields = line.split(",")
if fields[10]=="Sweden":
Company=(fields[1]).split("/v")
Company=str(Company).replace('[',"")
... stripping and manipulating the records
...
Company=Company.decode('utf-8')
url='http://www.proff.se/bransch-sök?q='
url=url.decode('utf-8')
Proff_se= ''.join((url,Company,"\n"))
Proff_urls.write(Company)
else:
continue
Why I keep thinking there is something weird going on when I am reading the file is that I have tested this, and it works fine.
# coding=utf-8
Svenska="äöå"
Dan_Nor="æøå"
Svenska=Svenska.decode('utf-8')
Dan_Nor=Dan_Nor.decode('utf-8')
string3 ="".join((Svenska,Dan_Norlow,Dan_NorCapital))
print string3
Thanks in advance, I have read a lot of questions related to these but I cannot really wrap my head around it.
The problem is almost certainly that your files aren't actually UTF-8, so trying to read them as if they were UTF-8 is failing. In particular, you claim that using codecs.open("Axel_List.csv", "r", "utf-8")
and then reading the file gives you this error:
'utf' codec can't byte 0xe4 in position 0
So, clearly, either it isn't really UTF-8, or it's corrupted.
Normally, it's hard to guess the encoding of a file without actually having the file. But in this case, it's easy.
Byte 0xe4 is ä
in Latin-1 (ISO-8859-1). And ä
is the first character that your code is looking for. So, your file is probably Latin-1.
The same byte is also ä
in two other legacy encodings sometimes used in Scandinavia, Latin-4 and Latin-6 (ISO-8859-4 and -10), so your file could be one of these.
In UTF-8, 0xe4 is a lead-byte for CJK characters. Unless you suspect that you really have a corrupted Japanese text file rather than a valid Swedish one, your file is definitely not UTF-8.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.