Python process a csv file to remove unicode characters greater than 3 bytes

Question

I'm using Python 2.7.5 and trying to take an existing CSV file and process it to remove unicode characters that are greater than 3 bytes. (Sending this to Mechanical Turk, and it's an Amazon restriction.)

I've tried to use the top (amazing) answer in this question ( How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8? ). I assume that I can just iterate through the csv row-by-row, and wherever I spot unicode characters of >3 bytes, replace them with a replacement character.

# -*- coding: utf-8 -*-
import csv
import re

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
ifile  = open('sourcefile.csv', 'rU')
reader = csv.reader(ifile, dialect=csv.excel_tab)
ofile  = open('outputfile.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)

#skip header row
next(reader, None)

for row in reader:
    writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c).encode('utf8')) for c in row])

ifile.close()
ofile.close()

I'm currently getting this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 264: ordinal not in range(128)

So this does iterate properly through some rows, but stops when it gets to the strange unicode characters.

I'd really appreciate some pointers; I'm completely confused. I've replaced 'utf8' with 'latin1' and unicode(c).encode to unicode(c).decode and I keep getting this same error.

Answer 1

Your input is still encoded data, not Unicode values. You'd need to decode to unicode values first, but you didn't specify an encoding to use. You then need to encode again back to encoded values to write back to the output CSV:

writer.writerow([re_pattern.sub(u'\uFFFD', unicode(c, 'utf8')).encode('utf8')
                 for c in row])

Your error stems from the unicode(c) call; without an explicit codec to use, Python falls back to the default ASCII codec.

If you use your file objects as context managers, there is no need to manually close them:

import csv
import re

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

def limit_to_BMP(value, patt=re_pattern):
    return patt.sub(u'\uFFFD', unicode(value, 'utf8')).encode('utf8')

with open('sourcefile.csv', 'rU') as ifile, open('outputfile.csv', 'wb') as ofile:
    reader = csv.reader(ifile, dialect=csv.excel_tab)
    writer = csv.writer(ofile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
    next(reader, None)  # header is not added to output file
    writer.writerows(map(limit_to_BMP, row) for row in reader)

I moved the replacement action to a separate function too, and used a generator expression to produce all rows on demand for the writer.writerows() function.

Python process a csv file to remove unicode characters greater than 3 bytes

Question

1 answers

solution1
3 ACCPTED 2014-08-18 11:34:13

Python process a csv file to remove unicode characters greater than 3 bytes

Question

1 answers

solution1 3 ACCPTED 2014-08-18 11:34:13

solution1
3 ACCPTED 2014-08-18 11:34:13