简体   繁体   English

“utf-8”编解码器无法解码 position 18 中的字节 0x92:无效的起始字节

[英]'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I am trying to read in a dataset called df1, but it does not work我正在尝试读取名为 df1 的数据集,但它不起作用

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")

df1.head()

Here are huge errors from the above code, but this is the most relevant以下是上述代码中的巨大错误,但这是最相关的

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

The data is indeed not encoded as UTF-8;数据确实没有编码为 UTF-8; everything is ASCII except for that single 0x92 byte:除了单个 0x92 字节外,一切都是 ASCII:

b'Korea, Dem. People\x92s Rep.'

Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, ' :将其解码为Windows 代码页 1252 ,其中 0x92 是一个花哨的引号, '

df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
                  sep=";", encoding='cp1252')

Demo:演示:

>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
...                   sep=";", encoding='cp1252')
>>> df1.head()
                   2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  \
0     Afghanistan  55.1  55.5  55.9  56.2  56.6  57.0  57.4  57.8  58.2  58.6
1         Albania  74.3  74.7  75.2  75.5  75.8  76.1  76.3  76.5  76.7  76.8
2         Algeria  70.2  70.6  71.0  71.4  71.8  72.2  72.6  72.9  73.2  73.5
3  American Samoa    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
4         Andorra    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..

   2010  2011  2012  2013  Unnamed: 15  2014  2015
0  59.0  59.3  59.7  60.0          NaN  60.4  60.7
1  77.0  77.2  77.4  77.6          NaN  77.8  78.0
2  73.8  74.1  74.3  74.6          NaN  74.8  75.0
3    ..    ..    ..    ..          NaN    ..    ..
4    ..    ..    ..    ..          NaN    ..    ..

I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL.但我注意到,大熊猫似乎采取按面值的HTTP头太大,当你从一个URL加载数据产生变为乱码。 When I save the data directly to disk, then load it with pd.read_csv() the data is correctly decoded, but loading from the URL produces re-coded data:当我将数据直接保存到磁盘,然后pd.read_csv()加载它时,数据被正确解码,但从 URL 加载会产生重新编码的数据:

>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'

This is a known bug in Pandas .这是Pandas 中的一个已知错误 You can work around this by using urllib.request to load the URL and pass that to pd.read_csv() instead:您可以通过使用urllib.request加载 URL 并将其传递给pd.read_csv()来解决此问题:

>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
...     df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'

It turned out that the csv created in mac os is being parsed on a windows machine, I got the UnicodeDecodeError.原来在mac os中创建的csv正在windows机器上解析,我得到了UnicodeDecodeError。 To get rid of this error, try passing argument encoding='mac-roman' to read_csv method of pandas library.要消除此错误,请尝试将参数 encoding='mac-roman' 传递给 pandas 库的 read_csv 方法。

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()

Output:输出:

    2000    2001    2002    2003    2004    2005    2006    2007    2008    2009    2010    2011    2012    2013    Unnamed: 15 2014    2015
0   Afghanistan 55.1    55.5    55.9    56.2    56.6    57.0    57.4    57.8    58.2    58.6    59.0    59.3    59.7    60.0    NaN 60.4    60.7
1   Albania 74.3    74.7    75.2    75.5    75.8    76.1    76.3    76.5    76.7    76.8    77.0    77.2    77.4    77.6    NaN 77.8    78.0
2   Algeria 70.2    70.6    71.0    71.4    71.8    72.2    72.6    72.9    73.2    73.5    73.8    74.1    74.3    74.6    NaN 74.8    75.0
3   American Samoa  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..
4   Andorra ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..

This problem occur because of some unknown characters in your file.出现此问题的原因是文件中存在一些未知字符。 for example, In your file with utf-8 encoding, there were some character in windows 1250. you should remove or replace this characters to solve your problems例如,在您使用 utf-8 编码的文件中,Windows 1250 中有一些字符。您应该删除或替换这些字符以解决您的问题

Use 'ISO-8859-1' instead of "utf-8" for decoding使用“ISO-8859-1”而不是“utf-8”进行解码

text = open(fn, 'rb').read().decode('ISO-8859-1') text = open(fn, 'rb').read().decode('ISO-8859-1')

Refer the link: https://grabthiscode.com/whatever/utf-8-codec-cant-decode-byte-0x85-in-position-715-invalid-start-byte参考链接: https://grabthiscode.com/whatever/utf-8-codec-cant-decode-byte-0x85-in-position-715-invalid-start-byte

This works这有效

df = pd.read_csv(inputfile, engine = 'python')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeDecodeError'utf-8'编解码器无法解码位置2893中的字节0x92:无效的起始字节 - UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte “utf-8”编解码器无法解码 position 107 中的字节 0x92:无效的起始字节 - 'utf-8' codec can't decode byte 0x92 in position 107: invalid start byte “utf-8”编解码器无法解码位置 11 中的字节 0x92:起始字节无效 - 'utf-8' codec can't decode byte 0x92 in position 11: invalid start byte 使用 CSVLogger 时出错:“utf-8”编解码器无法解码位置 144 中的字节 0x92:起始字节无效 - Error using CSVLogger: 'utf-8' codec can't decode byte 0x92 in position 144: invalid start byte 我不断收到 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte - I keep getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte UnicodeDecodeError:“ utf8”编解码器无法解码位置661中的字节0x92:无效的起始字节 - UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 661: invalid start byte Anaconda:UnicodeDecodeError:'utf8'编解码器无法解码位置1412中的字节0x92:无效的起始字节 - Anaconda: UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte Python错误:“ utf8”编解码器无法解码位置85的字节0x92:无效的起始字节 - Python error: 'utf8' codec can't decode byte 0x92 in position 85: invalid start byte 如何修复:UnicodeDecodeError:“utf-8”编解码器无法解码 position 中的字节 0x81 18:起始字节无效 - How to fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 18: invalid start byte UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 3131 中的字节 0x80:起始字节无效 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM