Python Unicode error, 'ascii' codec can't encode character

Question

I am getting the following error :

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 587: ordinal not in range(128)

My code:

import os
from bs4 import BeautifulSoup

do = dir_with_original_files = 'C:\Users\Me\Directory'
dm = dir_with_modified_files = 'C:\Users\Me\Directory\New'
for root, dirs, files in os.walk(do):
    for f in files:
        if f.endswith('~'): #you don't want to process backups
            continue
        original_file = os.path.join(root, f)
        mf = f.split('.')
        mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name 
                                             # if you omit the last two lines.
                                             # They are in separate directories
                                             # anyway. In that case, mf = f
        modified_file = os.path.join(dm, mf)
        with open(original_file, 'r') as orig_f, \
             open(modified_file, 'w') as modi_f:
            soup = BeautifulSoup(orig_f.read())
            for t in soup.find_all('td', class_='test'):
                t.string.wrap(soup.new_tag('h2'))
            # This is where you create your new modified file.
            modi_f.write(soup.prettify())

This code is iterating over a directory, and for each file finds all of the tds of class test and adds h2 tags to the text within the td. So previously, it would have been :

<td class="test"> text </td>

After running this program, a new file will be created with :

<td class="test"> <h2>text</h2> </td>

Or this is how I would like it to function. Unfortunately, currently, I am getting the error described above. I believe this is because I am parsing some text which includes accented characters and is written in Spanish, with special Spanish characters.

What can I do to fix my issue?

Answer 1

soup.prettify() returns a Unicode string , but your file expects a byte string . Python tries to help here and automatically encodes the result, but your Unicode string contains codepoints that are beyond the ASCII standard and thus the encoding fails.

You'll have to either manually encode to a different codec, or use a different file object type that'll do this automatically for you.

In this case, I'd encode to the original encoding that BeautifulSoup detected for you:

modi_f.write(soup.prettify().encode(soup.original_encoding))

The soup.original_encoding reflects what the BeautifulSoup decoded the unmodified HTML as, and is based (if at all available) on the encoding that the HTML itself declared, or an educated guess based on statistical analysis of the bytes of the original data.

Python Unicode error, 'ascii' codec can't encode character

Question

1 answers

solution1
1 ACCPTED 2014-12-05 11:24:14

Python Unicode error, 'ascii' codec can't encode character

Question

1 answers

solution1 1 ACCPTED 2014-12-05 11:24:14

solution1
1 ACCPTED 2014-12-05 11:24:14