[英]'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
I am trying to read in a dataset called df1, but it does not work我正在尝试读取名为 df1 的数据集,但它不起作用
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
Here are huge errors from the above code, but this is the most relevant以下是上述代码中的巨大错误,但这是最相关的
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
The data is indeed not encoded as UTF-8;数据确实没有编码为 UTF-8; everything is ASCII except for that single 0x92 byte:除了单个 0x92 字节外,一切都是 ASCII:
b'Korea, Dem. People\x92s Rep.'
Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, '
:将其解码为Windows 代码页 1252 ,其中 0x92 是一个花哨的引号, '
:
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
sep=";", encoding='cp1252')
Demo:演示:
>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
... sep=";", encoding='cp1252')
>>> df1.head()
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5
3 American Samoa .. .. .. .. .. .. .. .. .. ..
4 Andorra .. .. .. .. .. .. .. .. .. ..
2010 2011 2012 2013 Unnamed: 15 2014 2015
0 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 .. .. .. .. NaN .. ..
4 .. .. .. .. NaN .. ..
I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL.但我注意到,大熊猫似乎采取按面值的HTTP头太大,当你从一个URL加载数据产生变为乱码。 When I save the data directly to disk, then load it with pd.read_csv()
the data is correctly decoded, but loading from the URL produces re-coded data:当我将数据直接保存到磁盘,然后用pd.read_csv()
加载它时,数据被正确解码,但从 URL 加载会产生重新编码的数据:
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'
This is a known bug in Pandas .这是Pandas 中的一个已知错误。 You can work around this by using urllib.request
to load the URL and pass that to pd.read_csv()
instead:您可以通过使用urllib.request
加载 URL 并将其传递给pd.read_csv()
来解决此问题:
>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
It turned out that the csv created in mac os is being parsed on a windows machine, I got the UnicodeDecodeError.原来在mac os中创建的csv正在windows机器上解析,我得到了UnicodeDecodeError。 To get rid of this error, try passing argument encoding='mac-roman' to read_csv method of pandas library.要消除此错误,请尝试将参数 encoding='mac-roman' 传递给 pandas 库的 read_csv 方法。
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()
Output:输出:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 15 2014 2015
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 American Samoa .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
4 Andorra .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
This problem occur because of some unknown characters in your file.出现此问题的原因是文件中存在一些未知字符。 for example, In your file with utf-8 encoding, there were some character in windows 1250. you should remove or replace this characters to solve your problems例如,在您使用 utf-8 编码的文件中,Windows 1250 中有一些字符。您应该删除或替换这些字符以解决您的问题
text = open(fn, 'rb').read().decode('ISO-8859-1') text = open(fn, 'rb').read().decode('ISO-8859-1')
Refer the link: https://grabthiscode.com/whatever/utf-8-codec-cant-decode-byte-0x85-in-position-715-invalid-start-byte参考链接: https://grabthiscode.com/whatever/utf-8-codec-cant-decode-byte-0x85-in-position-715-invalid-start-byte
This works这有效
df = pd.read_csv(inputfile, engine = 'python')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.