[英]Python Unicode error, 'ascii' codec can't encode character
I am getting the following error : 我收到以下错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 587: ordinal not in range(128)
My code: 我的代码:
import os
from bs4 import BeautifulSoup
do = dir_with_original_files = 'C:\Users\Me\Directory'
dm = dir_with_modified_files = 'C:\Users\Me\Directory\New'
for root, dirs, files in os.walk(do):
for f in files:
if f.endswith('~'): #you don't want to process backups
continue
original_file = os.path.join(root, f)
mf = f.split('.')
mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name
# if you omit the last two lines.
# They are in separate directories
# anyway. In that case, mf = f
modified_file = os.path.join(dm, mf)
with open(original_file, 'r') as orig_f, \
open(modified_file, 'w') as modi_f:
soup = BeautifulSoup(orig_f.read())
for t in soup.find_all('td', class_='test'):
t.string.wrap(soup.new_tag('h2'))
# This is where you create your new modified file.
modi_f.write(soup.prettify())
This code is iterating over a directory, and for each file finds all of the tds of class test and adds h2 tags to the text within the td. 这段代码在目录上进行迭代,并且对于每个文件,查找类test的所有tds并将h2标记添加到td中的文本。 So previously, it would have been : 因此,以前是:
<td class="test"> text </td>
After running this program, a new file will be created with : 运行该程序后,将使用以下命令创建一个新文件:
<td class="test"> <h2>text</h2> </td>
Or this is how I would like it to function. 或这就是我希望它起作用的方式。 Unfortunately, currently, I am getting the error described above. 不幸的是,目前,我遇到了上述错误。 I believe this is because I am parsing some text which includes accented characters and is written in Spanish, with special Spanish characters. 我相信这是因为我正在解析一些包含重音符号的文本,这些文本用西班牙语写成特殊的西班牙字符。
What can I do to fix my issue? 我该怎么做才能解决我的问题?
soup.prettify()
returns a Unicode string , but your file expects a byte string . soup.prettify()
返回Unicode字符串 ,但是您的文件需要一个字节字符串 。 Python tries to help here and automatically encodes the result, but your Unicode string contains codepoints that are beyond the ASCII standard and thus the encoding fails. Python尝试在此处提供帮助并自动对结果进行编码,但是您的Unicode字符串包含的编码点超出了ASCII标准,因此编码失败。
You'll have to either manually encode to a different codec, or use a different file object type that'll do this automatically for you. 您将必须手动编码为其他编解码器,或者使用其他文件对象类型来自动为您执行此操作。
In this case, I'd encode to the original encoding that BeautifulSoup detected for you: 在这种情况下,我将编码为BeautifulSoup为您检测到的原始编码 :
modi_f.write(soup.prettify().encode(soup.original_encoding))
The soup.original_encoding
reflects what the BeautifulSoup decoded the unmodified HTML as, and is based (if at all available) on the encoding that the HTML itself declared, or an educated guess based on statistical analysis of the bytes of the original data. soup.original_encoding
反映了BeautifulSoup解码未经修改的HTML的内容,并且基于(如果有的话)基于HTML本身声明的编码,或者基于对原始数据字节的统计分析得出的有根据的猜测。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.