[英]UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte
[英]'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
我正在嘗試讀取名為 df1 的數據集,但它不起作用
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
以下是上述代碼中的巨大錯誤,但這是最相關的
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
數據確實沒有編碼為 UTF-8; 除了單個 0x92 字節外,一切都是 ASCII:
b'Korea, Dem. People\x92s Rep.'
將其解碼為Windows 代碼頁 1252 ,其中 0x92 是一個花哨的引號, '
:
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
sep=";", encoding='cp1252')
演示:
>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
... sep=";", encoding='cp1252')
>>> df1.head()
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5
3 American Samoa .. .. .. .. .. .. .. .. .. ..
4 Andorra .. .. .. .. .. .. .. .. .. ..
2010 2011 2012 2013 Unnamed: 15 2014 2015
0 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 .. .. .. .. NaN .. ..
4 .. .. .. .. NaN .. ..
但我注意到,大熊貓似乎采取按面值的HTTP頭太大,當你從一個URL加載數據產生變為亂碼。 當我將數據直接保存到磁盤,然后用pd.read_csv()
加載它時,數據被正確解碼,但從 URL 加載會產生重新編碼的數據:
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'
這是Pandas 中的一個已知錯誤。 您可以通過使用urllib.request
加載 URL 並將其傳遞給pd.read_csv()
來解決此問題:
>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
原來在mac os中創建的csv正在windows機器上解析,我得到了UnicodeDecodeError。 要消除此錯誤,請嘗試將參數 encoding='mac-roman' 傳遞給 pandas 庫的 read_csv 方法。
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()
輸出:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 15 2014 2015
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 American Samoa .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
4 Andorra .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
出現此問題的原因是文件中存在一些未知字符。 例如,在您使用 utf-8 編碼的文件中,Windows 1250 中有一些字符。您應該刪除或替換這些字符以解決您的問題
text = open(fn, 'rb').read().decode('ISO-8859-1')
這有效
df = pd.read_csv(inputfile, engine = 'python')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.