簡體   English   中英

熊貓read_csv文件導入錯誤

[英]Pandas read_csv file importing error

我正在嘗試在Pandas中導入一個csv文件,但會引發錯誤。 在notepad ++中打開時的數據格式如下,第一行為列名:

"End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date"
"11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,""Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw"",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,""Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho"",125.85,1,FY07,12/28/2006"
"12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008"
"12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008"
"11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008"
"13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008"
"7872800,Science Application International Corporation,2839760,GOVERNMENT OF KARNATAKA,10237455,Cubic Computing P.L,Server & CAL,SQL Server & CAL,SQL Server Standard,SQL Server Standard Edition,Non-specific,Open,Open SA/UA,Deferred Open SA - Renewal,0,0,FY09,1/15/2009"
"13096361,Pratham Software Pvt Ltd,13096361,Pratham Software Pvt Ltd,10133086,Krap Computer,Information Worker,Office,Office Standard / Basic,Office Standard,2007,Open,Open L,Open Std,7132.44,28,FY09,9/24/2008"
"12192276,Texmo Precision Castings,12192276,Texmo Precision Castings,4059430,Quadra Systems. - Partner,Server & CAL,Windows Server & CAL,Windows Standard Server,Windows Server Standard,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,11/15/2008"

請注意,雙擊以csv格式打開的同一個文件會在excel中以逗號分隔的值打開,但每行中都沒有引號,如notepad ++所示。

我已將編碼用作UTF-8,從而產生以下錯誤:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 13: invalid start byte

然后首先使用encoding ='cp1252',然后嘗試使用latin1。

df=pd.read_csv(filename,encoding='cp1252') 

or 

df=pd.read_csv(filename,encoding='latin1')

使用這兩種編碼,它都沒有給出任何錯誤,並且導入了數據,但它們只是一個單獨的列,而不是不同的列。

它是否與數據中每一行之前的“”標記有關? 我有一個類似的帶有逗號分隔值的csv文件,但該文件的每一行都沒有雙引號,並且已使用cp1252和latin1正確導入。 但是,即使文件以utf8格式保存在記事本++中,也不適用於UTF-8。 但是在這種情況下,utf8無法正常工作,而其他兩個將其作為單列導入。

請指教。

謝謝

我很確定引號引起它將其中的所有逗號解釋為轉義。 因此,您需要將它們全部剝離。 這樣做相對簡單,但是由於unicode的問題,我會發瘋,建議您讀入,刪除引號,然后將其寫入文件以與read_csv一起使用(因為這將簡化編碼問題) 。

以下是寫入文件並去除引號,寫入新文件然后使用read_csv讀取的方法:

with open(filename) as infile, open(tmpfile, 'wb') as outfile:
    for line in infile:
        outfile.write(line.strip('"'))

result = pd.read_csv(tmpfile, encoding='cp1252')

閱讀完臨時文件后,您還想刪除它。

我建議這樣做的原因是因為避免在傳遞給StringIO緩沖區時處理編碼/解碼-對於Python和熊貓來說可能都是挑剔的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM