简体   繁体   中英

How to ignore Unicode whitespace characters encoded in UTF-8?

I have a csv file with the following information:

id  name    age     height  weight
1   x       12      11      124
2   y       13      23      432
3   z       14      43      1435

It's stored in a file called Workbook2.csv I use the following code:

ipFile = csv.DictReader(open('Workbook2.csv', 'rU'))
dict = {} # Tring to update the rows to this dictionary.
for row in ipFile:
    print row

I get the following result:

{'weight': '124', '\xef\xbb\xbfid': '1', 'height ': '11', 'age   ': '12', 'name ': 'x'}
{'weight': '432', '\xef\xbb\xbfid': '2', 'height ': '23', 'age   ': '13', 'name ': 'y'}
{'weight': '1435', '\xef\xbb\xbfid': '3', 'height ': '43', 'age   ': '14', 'name ': 'z'}

I would like to know how I can update this output to a dictionary. I would also like to know how I can ignore the unicode characters that are encoded usig UTF-8, if there is a filter I can use to eliminate them.

Your input data contains UTF-8 BOM sequences on every single line . Whatever produced this file seemed to have been appending data one line at a time using the utf-8-sig codec or the non-Python equivalent. The BOM (if used at all) should be the first character in the file and not used anywhere else . Your data is broken, if you can possibly fix this at the source, do so.

However, there is a way to repair this as you read. The 'file' that is read by the csv module can be anything that produces lines as you iterate. Use a generator to filter the file lines first:

from codecs import BOM_UTF8

def bom_filter(lines):
    for line in lines:
        if line.startswith(BOM_UTF8):
            line = line[len(BOM_UTF8):]
        yield line

then pass your file through the filter before passing it to the DictReader() object:

with open('Workbook2.csv', 'rU') as inputfile:
    ipFile = csv.DictReader(bom_filter(inputfile))

Demo:

>>> from io import BytesIO
>>> import csv
>>> from codecs import BOM_UTF8
>>> def bom_filter(lines):
...     for line in lines:
...         if line.startswith(BOM_UTF8):
...             line = line[len(BOM_UTF8):]
...         yield line
...
>>> demofile = BytesIO('''\
... \xef\xbb\xbfid,name,age,height,weight
... \xef\xbb\xbf1,x,12,11,124
... \xef\xbb\xbf2,y,13,23,432
... \xef\xbb\xbf3,z,14,43,1435
... ''')
>>> ipFile = csv.DictReader(bom_filter(demofile))
>>> for row in ipFile:
...     print row
...
{'age': '12', 'height': '11', 'id': '1', 'weight': '124', 'name': 'x'}
{'age': '13', 'height': '23', 'id': '2', 'weight': '432', 'name': 'y'}
{'age': '14', 'height': '43', 'id': '3', 'weight': '1435', 'name': 'z'}

In Python 3, the csv module takes Unicode string input (as opposed to bytestrings, so now you need to look for the decoded result, the U+FEFF zero-width space codepoint. To make the code work on either Python version, you'd have to swap out what you are testing for at the start of the line:

import sys
to_filter = u'\ufeff'
if sys.version_info < (3,):
    to_filter = to_filter.encode('utf8')

def bom_filter(lines):
    for line in lines:
        if line.startswith(to_filter):
            line = line[len(to_filter):]
        yield line

There is a kwarg skipinitialspace but I verified from the C code that it only looks for ' '.

Two possibilities:

  1. Subclass DictReader to add some code to strip out the spaces
  2. Massage the lines from your file before passing them into the DictReader.

An example of (2) would be:

import io
with io.open('Workbook2.csv', 'r', encoding='utf8') as infile:
    ipFile = csv.DictReader((x.replace(u"\uFEFF", u" ") for x in infile))
....

I think that the output is clearly misinterpreted.

DictReader takes the fieldnames from the first line and the first column (that one after the invisible BOM) is just "id". That is why the id field now has the BOM prepended in every record.

In python 2.7 and 3.6, I had to use the dialect csv.excel_tab to enable interpretation of the tabs as delimiters.

Your input data/csv file is absolutely ok, as there is just a BOM at the beginning (where it should be). You just need to strip the BOM before reading.

Eg like this:

from codecs import BOM_UTF8
csv_file = open('test2.csv', 'rU')
csv_file.seek(len(BOM_UTF8))
ipFile = csv.DictReader(csv_file, dialect=csv.excel_tab)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM