Python encoding issue, can't seem to figure it out

Question

Hey I am having this major issue with encoding in python. I am not too familiar with python and have been stuck on this bug for weeks. I feel like I've tried every possible thing but I can't seem to get it.

I am reading files in to work with and am getting the following error on some files that have Chinese charaters.

 'ascii' codec can't encode characters in position 10314-10316: ordinal not in range(128)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/site-packages/cc_counter-0.65-py2.7.egg/cc_counter/views.py", line 154, in reviewrequest_recent_cc
    prev_reviewrequest_ccdata = _reviewrequest_recent_cc(request, review_request_id, False, revision_offset=1)
  File "/usr/lib/python2.7/site-packages/cc_counter-0.65-py2.7.egg/cc_counter/views.py", line 140, in _reviewrequest_recent_cc
    filename, comparison_data = _download_comparison_data(request, review_request_id, revision, filediff_id, modified)
  File "/usr/lib/python2.7/site-packages/cc_counter-0.65-py2.7.egg/cc_counter/views.py", line 89, in _download_comparison_data
    revision, filediff_id, local_site, modified)
  File "/usr/lib/python2.7/site-packages/cc_counter-0.65-py2.7.egg/cc_counter/views.py", line 68, in _download_analysis
    temp_file.write(working_file)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10314-10316: ordinal not in range(128)

My code in this area looks this this:

working_file = get_original_file(filediff, request, encoding_list)

if modified:
    working_file = get_patched_file(working_file, filediff, request)

working_file = convert_to_unicode(working_file, encoding_list)[1]
logging.debug("Encoding List: %s", encoding_list)
logging.debug("Source File: " + filediff.source_file)

temp_file_name = "cctempfile_" + filediff.source_file.replace("/","_")
logging.debug("temp_file_name: " + temp_file_name)
source_file = os.path.join(HOMEFOLDER, temp_file_name)


logging.debug("File contents" + working_file)
#temp_file = codecs.open(source_file, encoding='utf-8')
#temp_file.write(working_file.encode('utf-8'))

temp_file = open(source_file, 'w')
temp_file.write(working_file)
temp_file.close()

Notice the commented out lines. Working file is never empty. The encoding from the logged "encoding list" is

Encoding List: [u'iso-8859-15']

Anything to help would be soooo appreciated. I have to take a break from this after 8 straight hours of debugging this + the previous two weeks.

Answer 1

The error indicates working_file is a Unicode string, but is being written to a file that was opened to expect a byte string. Python 2 uses the default ascii codec to implicitly convert the Unicode string to a byte string, and non-ASCII characters trigger the UnicodeEncodeError .

The commented lines are close to correct, but the write will expect Unicode strings with codecs.open , so no need to explicitly encode, and the file needs to be opened for writing:

temp_file = codecs.open(source_file, 'w', encoding='utf-8')
temp_file.write(working_file)

Answer 2

What is the return type of your convert_to_unicode function?

If it is bytes, you probably should change temp_file = open(source_file, 'w') to temp_file = open(source_file, 'wb') , which means writing bytes into file.

Python encoding issue, can't seem to figure it out

Question

2 answers

solution1
1 2015-12-08 03:07:47

solution2
0 2015-12-08 02:41:28

Python encoding issue, can't seem to figure it out

Question

2 answers

solution1 1 2015-12-08 03:07:47

solution2 0 2015-12-08 02:41:28

solution1
1 2015-12-08 03:07:47

solution2
0 2015-12-08 02:41:28