简体   繁体   English

Python Pandas read_excel 在 Pandas 0.18.1 和 Pandas > 0.19 之间解析多索引数据帧时的不同行为

[英]Python Pandas read_excel different behavior in parsing MultiIndex dataframe between Pandas 0.18.1 and Pandas > 0.19

I am totally confused.我完全糊涂了。 Probably I miss an update in pandas API.可能我错过了pandas API 的更新。

So I have this excel file所以我有这个excel文件

在此处输入图片说明

In Pandas 0.18.1 I did not encounter any issue in reading and parsing the file.Pandas 0.18.1我在读取和解析文件时没有遇到任何问题。 I used the following code,我使用了以下代码,

import pandas as pd
fname = 'SAMPLE_EXCEL_CAUSING_ERROR_IN_PANDAS_0_19_UP.xlsx'
pd.read_excel(fname, 'Sheet1', header=[0,1], index=[0,1])

It returned something that I wish for.它返回了我想要的东西。

[1]:https://i.stack.imgur.com/Muliy.png

Recently, I updated my packages, and now my pandas is in version 0.20.1 .最近,我更新了我的包,现在我的pandasversion 0.20.1 However, when I tried to execute the same code with the same excel file, it returned an error.但是,当我尝试使用相同的 excel 文件执行相同的代码时,它返回了一个错误。 Here is the error message: ValueError: Length of new names must be 1, got 2 .这是错误消息: ValueError: Length of new names must be 1, got 2

在此处输入图片说明

Any clue where did I miss the new API in read_excel ?我在哪里错过了read_excel的新 API 的任何线索? I am totally confused.我完全糊涂了。 Is there any workaround to read the excel file with MultiIndex columns?是否有任何解决方法可以读取带有MultiIndex列的excel文件? My real data has level 3 index instead of level 2 index.我的真实数据有 3 级索引而不是 2 级索引。 Thanks a lot for any suggestions.非常感谢您的任何建议。

PS I cannot downgrade into 0.18.1 because my users are using 0.20.1 PS 我不能降级到0.18.1因为我的用户使用的是0.20.1

UPDATE更新

Strangely if I set the header into header=[1,2] then it did not throw any error message.奇怪的是,如果我将header header=[1,2]设置为header=[1,2]那么它没有抛出任何错误消息。 However, I got the wrong level as my Index.但是,我的索引级别错误。 Still trying to get the workaround for this issue.仍在尝试解决此问题。

在此处输入图片说明

You could make your index semi-manually您可以半手动制作索引

Get the header获取标题

Only take the first 2 rows, starting from the second column, and fill the empty cells starting from the left只取前两行,从第二列开始,从左边开始填充空单元格

header = pd.read_excel(fname, 'Sheet1', index=[0], header=None).iloc[:2, 1:].ffill(axis=1)

Get the data获取数据

omit the first 2 rows, and set the first column as index省略前 2 行,并将第一列设置为索引

df = pd.read_excel(fname, 'Sheet1', skiprows=[0,1], index=0, header=None).rename(columns={0: 'A'}).set_index('A')

MultiIndex

df.columns=pd.MultiIndex.from_arrays(header.values)

df df

    B       D       F
    C   C   E   E   G   G
A                       
A1  X   Y   Z   U   J   K
A2  XX  YY  ZZ  UU  JJ  KK
A3  XXX     YYY     ZZZ     UUU     JJJ     KKK

I use anaconda on windows and I did an update on pandas using:我在 Windows 上使用 anaconda,并使用以下方法对 Pandas 进行了更新:

conda update pandas

and now i get the desired result.现在我得到了想要的结果。

here's my data:这是我的数据:

https://i.stack.imgur.com/ho13H.png https://i.stack.imgur.com/ho13H.png

and this the result:结果如下:

  pandas.read_excel(file, index_col=0, header=[0,1])
Out[7]: 
            MIN                                                            \
             C1       C2           C3       C4       T1       T2       T3   
0    195.207890        -   101.464978  142.434        -        -  943.799   
1   1018.091967  982.585  1008.165221   1089.3  3579.36  2897.13        -   
3    719.242505  768.078   798.991606  979.055   1562.6  1503.61  1635.22   
7    464.714785  115.527   339.229797  68.8829  181.388  552.229   809.36   
8    238.469139  173.027   197.930122        -  633.154  610.908  495.791   
10   384.770673  532.663   230.583377  444.087  2105.43  1109.59  1362.43   
14   420.279847  401.482   323.935379  393.111  1135.14  969.754  1030.53   
15   529.268355  375.933   501.639846  561.166  3001.63   3030.6  2617.78   
21   262.806700  259.203   444.979777        -  1194.33  1260.72  1070.19   
28   280.283310  287.044   275.809979  257.798  622.784  899.187  512.905   

                         COM                ...                               \
             T4           C1           C2   ...              T3           T4   
0   1235.989828   132.088723   127.384065   ...     1647.217230   448.336279   
1   3406.803144  1092.341474  1263.549755   ...     3399.548560  3144.652639   
3   1400.570267   911.110083   754.166616   ...     1651.774770  1690.612134   
7    734.734422   587.381973   568.188789   ...      872.138431   912.248578   
8    417.361810   182.506544   164.936057   ...      765.018292  1070.565315   
10  1148.614845   424.377037   449.054287   ...     1293.657158  1960.196871   
14   947.046046   536.630139   482.741047   ...     1041.749363  1159.747331   
15  2164.517695   558.597139   721.841033   ...     2548.803301  2743.534159   
21  1198.080826   530.759489   663.639841   ...     1372.536515  1296.604595   
28   665.474002   453.753921   236.935726   ...     1001.108816   677.224724   

           UND                                                              \
            C1          C2          C3       C4       T1       T2       T3   
0    17.126508   86.103158   84.637729        -        -    438.8  2004.51   
1   856.588602  696.177886  697.322434  684.055  4238.82  3420.76  2339.89   
3   523.836538  488.007532  804.293445  467.541        -  666.135  594.047   
7   289.235298  272.521236  239.166506  250.247  602.523  449.244  547.401   
8   140.495332   77.390391  114.810278  149.386        -        -        -   
10  220.994094  208.610597  233.131489  223.641  1082.33  1115.45  1040.81   
14  228.683350  250.932989  213.735624  225.627  623.491  598.308  555.539   
15  283.552101  280.930293  293.570089  244.061  791.533  1181.63  1069.91   
21  243.737751  233.957191  200.573198  203.905  795.793  903.329  1029.72   
28  155.424805  236.211838  197.738949  175.754  728.167  687.443  917.051   


         T4  
0   1555.74  
1   3369.02  
3   679.208  
7   207.939  
8         -  
10  1043.51  
14  602.447  
15  844.655  
21  958.073  
28  572.275  

[10 rows x 24 columns]

问题似乎来自重复的列名,例如,示例中的两个 (B, C) 列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM