简体   繁体   中英

How do I read in a CSV file that contains pound symbols?

My file has a NUL byte at the beginning and I struggle with the "£" symbol

data_initial = codecs.open(filename, "rU", "utf-16")
data = csv.DictReader((line.replace('\x00','') for line in data_initial), delimiter="\t")
    for row in data:
        print row

I get the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\\xa3' in position 169: ordinal not in range(128)

BTW: it doesn't matter if I try to print this line out or not. I can print just '1' and the error remains the same. I do not know why it says it's an encoding error when it's probably a decoding error.

In any case, how can I deal with the problem?

The problem is almost certainly that codecs.open(filename, "rU", "utf-16") is converting the "£" symbol in a way that's incompatible with csv :

This version of the csv module doesn't support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

Simply changing the encoding type to "utf-8" (assuming there aren't incompatible symbols in the file) should fix the problem: codecs.open(filename, "rU", "utf-8")

I will assume that you are using Python 2.7 here. Under Python 2.7, the CSV has no support for python unicode strings. You must read in the file in raw binary, and then decode the strings after csv returns them. You cannot decode the file as you are reading it and expect csv to deal with it; my experience is it won't cope.

This is very different in Python 3.x where csv does support Unicode, and you must do the decoding before csv reads the data. Or it won't work.

It is kind of annoying that there is such a large difference between the two cases.

My oldish but tested code to work with all Python versions looks like (suspect you should be able to replace "ascii" with whatever you want). Yes, just noticed some of the asserts are somewhat pointless, however quoting the original tested code here regardless.

if sys.version_info < (3, 0):
    # Python2: csv module does not support unicode, we must use byte strings.   

    def _input_csv(csv_data):
        for line in csv_data:
            assert isinstance(line, bytes)
            yield line

    def _output_csv(csv_line):
        for i, column in enumerate(csv_line):
            csv_line[i] = column.decode("ascii", errors='ignore')
            assert isinstance(csv_line[i], unicode)  # NOQA

else:
    # Python3: csv module does support unicode, we must use strings everywhere, 
    # not byte strings

    def _input_csv(unicode_csv_data):
        for line in unicode_csv_data:
            assert isinstance(line, bytes)
            line = line.decode("ascii", errors='ignore')
            assert isinstance(line, str)
            yield line

    def _output_csv(csv_line):
        for column in csv_line:
            assert isinstance(column, str)

And where I do the read (in this case from a subprocess):

reader = csv.reader(_input_csv(process.stdout), delimiter="|")
for row in reader:
    _output_csv(row)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM