简体   繁体   English

UnicodeDecodeError:“ascii”编解码器无法解码位置 2 中的字节 0xd1:序数不在范围内(128)

[英]UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I am attempting to work with a very large dataset that has some non-standard characters in it.我正在尝试使用一个非常大的数据集,其中包含一些非标准字符。 I need to use unicode, as per the job specs, but I am baffled.根据工作规范,我需要使用 unicode,但我很困惑。 (And quite possibly doing it all wrong.) (而且很可能做错了。)

I open the CSV using:我使用以下方法打开 CSV:

 15     ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='\t', quotechar='"')

Then, I attempt to encode it with:然后,我尝试使用以下代码对其进行编码:

name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])

I'm encoding everything except the lat and lng because those need to be sent out to an API.我正在对除 lat 和 lng 之外的所有内容进行编码,因为它们需要发送到 API。 When I run the program to parse the dataset into what I can use, I get the following Traceback.当我运行程序将数据集解析为我可以使用的内容时,我得到以下 Traceback。

Traceback (most recent call last):
  File "push_into_db.py", line 80, in <module>
    main()
  File "push_into_db.py", line 74, in main
    district_map = buildDistrictSchoolMap()
  File "push_into_db.py", line 32, in buildDistrictSchoolMap
    county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)

I think I should tell you that I'm using python 2.7.2, and this is part of an app build on django 1.4.我想我应该告诉你我正在使用 python 2.7.2,这是基于 django 1.4 构建的应用程序的一部分。 I've read several posts on this topic, but none of them seem to directly apply.我已经阅读了有关此主题的几篇文章,但似乎没有一篇直接适用。 Any help will be greatly appreciated.任何帮助将不胜感激。

You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.您可能还想知道导致问题的一些非标准字符是 Ñ 并且可能是 É。

Unicode is not equal to UTF-8. Unicode 不等于 UTF-8。 The latter is just an encoding for the former.后者只是前者的编码

You are doing it the wrong way around.你做错了。 You are reading UTF-8- encoded data, so you have to decode the UTF-8-encoded String into a unicode string.您正在读取UTF-8编码的数据,因此您必须将 UTF-8 编码的字符串解码为 un​​icode 字符串。

So just replace .encode with .decode , and it should work (if your .csv is UTF-8-encoded).所以只需用.encode替换.decode ,它应该可以工作(如果你的 .csv 是 UTF-8 编码的)。

Nothing to be ashamed of, though.不过也没什么好丢脸的。 I bet 3 in 5 programmers had trouble at first understanding this, if not more ;)我敢打赌,五分之三的程序员一开始都很难理解这一点,如果不是更多的话;)

Update: If your input data is not UTF-8 encoded, then you have to .decode() with the appropriate encoding, of course.更新:如果您的输入数据不是UTF-8 编码的,那么您当然必须使用适当的编码进行.decode() If nothing is given, python assumes ASCII, which obviously fails on non-ASCII-characters.如果没有给出任何内容,python 假定 ASCII,这显然在非 ASCII 字符上失败。

Just add this lines to your codes :只需将此行添加到您的代码中:

1.Python2 1.Python2

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

2.Python3 2.Python3

import sys
from importlib import reload
reload(sys)
sys.setdefaultencoding('utf-8')

for Python 3 users.对于 Python 3 用户。 you can do你可以做

with open(csv_name_here, 'r', encoding="utf-8") as f:
    #some codes

it works with flask too :)它也适用于烧瓶:)

The main reason for the error is that the default encoding assumed by python is ASCII.错误的主要原因是python假设的默认编码是ASCII。 Hence, if the string data to be encoded by encode('utf8') contains character that is outside of ASCII range eg for a string like 'hgvcj터파크387', python would throw error because the string is not in the expected encoding format.因此,如果要由encode('utf8')编码的字符串数据包含超出 ASCII 范围的字符,例如对于像 'hgvcj터파크387' 这样的字符串,python 会抛出错误,因为该字符串不是预期的编码格式.

If you are using python version earlier than version 3.5, a reliable fix would be to set the default encoding assumed by python to utf8 :如果您使用的 python 版本早于 3.5 版,可靠的解决方法是将 python 假定的默认编码设置为utf8

import sys
reload(sys)
sys.setdefaultencoding('utf8')
name = school_name.encode('utf8')

This way python would be able to anticipate characters within a string that fall outside of ASCII range.这样,python 将能够预测字符串中超出 ASCII 范围的字符。

However, if you are using python version 3.5 or above, reload() function is not available, so you would have to fix it using decode eg但是,如果您使用的是 python 3.5 或更高版本,则 reload() 函数不可用,因此您必须使用 decode 来修复它,例如

name = school_name.decode('utf8').encode('utf8')

For Python 3 users:对于 Python 3 用户:

changing the encoding from 'ascii' to 'latin1' works.将编码从“ascii”更改为“latin1”有效。

Also, you can try finding the encoding automatically by reading the top 10000 bytes using the below snippet:此外,您可以尝试通过使用以下代码段读取前 10000 个字节来自动查找编码:

import chardet  
with open("dataset_path", 'rb') as rawdata:  
            result = chardet.detect(rawdata.read(10000))  
print(result)

My computer had the wrong locale set.我的电脑设置了错误的语言环境。

I first did我第一次做

>>> import locale
>>> locale.getpreferredencoding(False)
'ANSI_X3.4-1968'

locale.getpreferredencoding(False) is the function called by open() when you don't provide an encoding . locale.getpreferredencoding(False)open()在您不提供 encoding 时调用的函数。 The output should be 'UTF-8' , but in this case it's some variant of ASCII .输出应该是'UTF-8' ,但在这种情况下它是 ASCII 的一些变体

Then I ran the bash command locale and got this output然后我运行了 bash 命令locale并得到了这个输出

$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

So, I was using the default Ubuntu locale, which causes Python to open files as ASCII instead of UTF-8.因此,我使用了默认的 Ubuntu 语言环境,这会导致 Python 以 ASCII 而不是 UTF-8 格式打开文件。 I had to set my locale to en_US.UTF-8我必须将我的语言环境设置en_US.UTF-8

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales

If you can't change the locale system wide, you can invoke all your Python code like this:如果您无法在系统范围内更改语言环境,则可以像这样调用所有 Python 代码:

PYTHONIOENCODING="UTF-8" python3 ./path/to/your/script.py

or do或者做

export PYTHONIOENCODING="UTF-8"

to set it in the shell you run that in.将其设置在您运行它的外壳中。

if you get this issue while running certbot while creating or renewing certificate, Please use the following method如果您在创建或更新证书时运行 certbot 时遇到此问题,请使用以下方法

grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx

That command found the offending character "´" in one .conf file in the comment.该命令在评论的一个 .conf 文件中发现了有问题的字符“´”。 After removing it (you can edit comments as you wish) and reloading nginx, everything worked again.删除它(您可以根据需要编辑评论)并重新加载 nginx 后,一切都恢复了。

Source : https://github.com/certbot/certbot/issues/5236来源: https ://github.com/certbot/certbot/issues/5236

由于纬度和经度,使用编码 UTF 16 打开。

with open(csv_name_here, 'r', encoding="utf-16") as f:

Or when you deal with text in Python if it is a Unicode text, make a note it is Unicode.或者当您在 Python 中处理文本时,如果它是 Unicode 文本,请记下它是 Unicode。

Set text=u'unicode text' instead just text='unicode text' .设置text=u'unicode text'而不是text='unicode text'

This worked in my case.这在我的情况下有效。

它只通过使用参数'rb'读取二进制而不是'r'读取来工作

Dealing with this issue inside of a Docker container.在 Docker 容器中处理这个问题。 It might be the case (as it was for me) that you only need to generate the locale and do nothing more:可能是这种情况(对我而言),您只需要生成语言环境而无需执行任何其他操作:

sudo locale-gen en_US en_US.UTF-8

In some case that was sufficient for me because locales was already installed and configured.在某些情况下,这对我来说已经足够了,因为已经安装和配置了语言环境。 If you have to install locales and configure it, add the following part to your Dockerfile:如果您必须安装语言环境并对其进行配置,请将以下部分添加到您的 Dockerfile 中:

RUN apt update && apt install locales && \
    sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \
    echo 'LANG="en_US.UTF-8"'>/etc/default/locale && \
    dpkg-reconfigure --frontend=noninteractive locales && \
    update-locale LANG=en_US.UTF-8

ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV LC_ALL en_US.UTF-8

I tested it like this:我是这样测试的:

cat <<EOF > /tmp/test.txt
++*=|@#|¼üöäàéàè!´]]¬|¢|¢¬|{ł|¼½{}}
EOF

python3
import pathlib; pathlib.Path("/tmp/test.txt").read_text()

I faced this issue while using Pickle for unloading.我在使用 Pickle 卸载时遇到了这个问题。 Try,尝试,

data = pickle.load(f,encoding='latin1')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码位置0的字节0xd0:序数不在范围内(128) - Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128) Python UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码位置12的字节0xd0:序数不在范围内(128) - Python UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 12: ordinal not in range(128) snapcraft 给出“ascii”编解码器无法解码位置 0 中的字节 0xd0:序号不在范围内(128) - snapcraft gives 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128) UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码位置0的字节0xdb:序数不在范围内(128) - UnicodeDecodeError: 'ascii' codec can't decode byte 0xdb in position 0: ordinal not in range(128) UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码位置0的字节0xe0:序数不在范围内(128) - UnicodeDecodeError : 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128) UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码位置0中的字节0xe2:序号不在范围内(128) - UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码位置23的字节0xc3:序数不在范围内(128) - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) UnicodeDecodeError:&#39;ascii&#39;编解码器无法解码位置11597中的字节0xff:序数不在范围内(128) - UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128) Netmiko 错误:UnicodeDecodeError:“ascii”编解码器无法解码位置 0 中的字节 0xff:序号不在范围内(128) - Netmiko error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) UnicodeDecodeError:“ascii”编解码器无法解码位置 13 中的字节 0xe2:序号不在范围内(128) - UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM