Reading a csv file (utf-8) and outputting and merging utf-8 strings

Question

I have some problems with encoding utf-8 when I read and write a file. I have a CSV file containing Danish and Swedish Letters (ÅÄÖ etc). I want to read this file and extract a field - and manipulate the data (to create urls).

What I am struggling with is the following:

I cannot read a file containing utf-8 letters - python outputs \\xd6 instead of ö .
I cannot merge two strings even though I am decoding them as (utf-8)

I have tried the following:

adding # -*- coding: utf-8 -*-
Companies = codecs.open("Axel_List.csv", "r", "utf-8") (reading the file with codecs lib), which produces this error - 'utf' codec can't byte 0xe4 in position 0
url=u'http://www.proff.se/bransch-sök?q=' and url='http://www.proff.se/bransch-sök?q=' followed by url.decode('utf-8') which produces the same error when I try to join the two strings:
UnicodeEncodeError 'ascii codec can't encode character u'\\xf6 in position 29

I can print the Company (even though they do not contain the correct letters) and the url separately, so there is something going on when I am joining them.

# -*- coding: utf-8 -*-
import re
import codecs
import os, sys
Google_urls=open('google_Urls','w')
Proff_urls=open('Proff_Urls','w')
Companies=("Company_List.csv")

for line in Companies:
    fields = line.split(",")
        if fields[10]=="Sweden":
            Company=(fields[1]).split("/v")
            Company=str(Company).replace('[',"")
            ... stripping and manipulating the records 
            ...
            Company=Company.decode('utf-8')
            url='http://www.proff.se/bransch-sök?q='
            url=url.decode('utf-8')
            Proff_se= ''.join((url,Company,"\n"))
            Proff_urls.write(Company) 
    else:
        continue

Why I keep thinking there is something weird going on when I am reading the file is that I have tested this, and it works fine.

# coding=utf-8
Svenska="äöå"
Dan_Nor="æøå"
Svenska=Svenska.decode('utf-8')
Dan_Nor=Dan_Nor.decode('utf-8')
string3 ="".join((Svenska,Dan_Norlow,Dan_NorCapital))
print string3

Thanks in advance, I have read a lot of questions related to these but I cannot really wrap my head around it.

Answer 1

The problem is almost certainly that your files aren't actually UTF-8, so trying to read them as if they were UTF-8 is failing. In particular, you claim that using codecs.open("Axel_List.csv", "r", "utf-8") and then reading the file gives you this error:

'utf' codec can't byte 0xe4 in position 0

So, clearly, either it isn't really UTF-8, or it's corrupted.

Normally, it's hard to guess the encoding of a file without actually having the file. But in this case, it's easy.

Byte 0xe4 is ä in Latin-1 (ISO-8859-1). And ä is the first character that your code is looking for. So, your file is probably Latin-1.

The same byte is also ä in two other legacy encodings sometimes used in Scandinavia, Latin-4 and Latin-6 (ISO-8859-4 and -10), so your file could be one of these.

In UTF-8, 0xe4 is a lead-byte for CJK characters. Unless you suspect that you really have a corrupted Japanese text file rather than a valid Swedish one, your file is definitely not UTF-8.

Reading a csv file (utf-8) and outputting and merging utf-8 strings

Question

1 answers

solution1
0 2014-01-07 02:14:03

Reading a csv file (utf-8) and outputting and merging utf-8 strings

Question

1 answers

solution1 0 2014-01-07 02:14:03

solution1
0 2014-01-07 02:14:03