[英]Python Pandas read_excel different behavior in parsing MultiIndex dataframe between Pandas 0.18.1 and Pandas > 0.19
I am totally confused.我完全糊涂了。 Probably I miss an update in
pandas
API.可能我错过了
pandas
API 的更新。
So I have this excel
file所以我有这个
excel
文件
In Pandas 0.18.1
I did not encounter any issue in reading and parsing the file.在
Pandas 0.18.1
我在读取和解析文件时没有遇到任何问题。 I used the following code,我使用了以下代码,
import pandas as pd
fname = 'SAMPLE_EXCEL_CAUSING_ERROR_IN_PANDAS_0_19_UP.xlsx'
pd.read_excel(fname, 'Sheet1', header=[0,1], index=[0,1])
It returned something that I wish for.它返回了我想要的东西。
Recently, I updated my packages, and now my pandas
is in version 0.20.1
.最近,我更新了我的包,现在我的
pandas
是version 0.20.1
。 However, when I tried to execute the same code with the same excel file, it returned an error.但是,当我尝试使用相同的 excel 文件执行相同的代码时,它返回了一个错误。 Here is the error message:
ValueError: Length of new names must be 1, got 2
.这是错误消息:
ValueError: Length of new names must be 1, got 2
。
Any clue where did I miss the new API in read_excel
?我在哪里错过了
read_excel
的新 API 的任何线索? I am totally confused.我完全糊涂了。 Is there any workaround to read the
excel
file with MultiIndex
columns?是否有任何解决方法可以读取带有
MultiIndex
列的excel
文件? My real data has level 3 index instead of level 2 index.我的真实数据有 3 级索引而不是 2 级索引。 Thanks a lot for any suggestions.
非常感谢您的任何建议。
PS I cannot downgrade into 0.18.1
because my users are using 0.20.1
PS 我不能降级到
0.18.1
因为我的用户使用的是0.20.1
UPDATE更新
Strangely if I set the header
into header=[1,2]
then it did not throw any error message.奇怪的是,如果我将
header
header=[1,2]
设置为header=[1,2]
那么它没有抛出任何错误消息。 However, I got the wrong level as my Index.但是,我的索引级别错误。 Still trying to get the workaround for this issue.
仍在尝试解决此问题。
You could make your index semi-manually您可以半手动制作索引
Only take the first 2 rows, starting from the second column, and fill the empty cells starting from the left只取前两行,从第二列开始,从左边开始填充空单元格
header = pd.read_excel(fname, 'Sheet1', index=[0], header=None).iloc[:2, 1:].ffill(axis=1)
omit the first 2 rows, and set the first column as index省略前 2 行,并将第一列设置为索引
df = pd.read_excel(fname, 'Sheet1', skiprows=[0,1], index=0, header=None).rename(columns={0: 'A'}).set_index('A')
MultiIndex
df.columns=pd.MultiIndex.from_arrays(header.values)
df
df
B D F
C C E E G G
A
A1 X Y Z U J K
A2 XX YY ZZ UU JJ KK
A3 XXX YYY ZZZ UUU JJJ KKK
I use anaconda on windows and I did an update on pandas using:我在 Windows 上使用 anaconda,并使用以下方法对 Pandas 进行了更新:
conda update pandas
and now i get the desired result.现在我得到了想要的结果。
here's my data:这是我的数据:
https://i.stack.imgur.com/ho13H.png https://i.stack.imgur.com/ho13H.png
and this the result:结果如下:
pandas.read_excel(file, index_col=0, header=[0,1])
Out[7]:
MIN \
C1 C2 C3 C4 T1 T2 T3
0 195.207890 - 101.464978 142.434 - - 943.799
1 1018.091967 982.585 1008.165221 1089.3 3579.36 2897.13 -
3 719.242505 768.078 798.991606 979.055 1562.6 1503.61 1635.22
7 464.714785 115.527 339.229797 68.8829 181.388 552.229 809.36
8 238.469139 173.027 197.930122 - 633.154 610.908 495.791
10 384.770673 532.663 230.583377 444.087 2105.43 1109.59 1362.43
14 420.279847 401.482 323.935379 393.111 1135.14 969.754 1030.53
15 529.268355 375.933 501.639846 561.166 3001.63 3030.6 2617.78
21 262.806700 259.203 444.979777 - 1194.33 1260.72 1070.19
28 280.283310 287.044 275.809979 257.798 622.784 899.187 512.905
COM ... \
T4 C1 C2 ... T3 T4
0 1235.989828 132.088723 127.384065 ... 1647.217230 448.336279
1 3406.803144 1092.341474 1263.549755 ... 3399.548560 3144.652639
3 1400.570267 911.110083 754.166616 ... 1651.774770 1690.612134
7 734.734422 587.381973 568.188789 ... 872.138431 912.248578
8 417.361810 182.506544 164.936057 ... 765.018292 1070.565315
10 1148.614845 424.377037 449.054287 ... 1293.657158 1960.196871
14 947.046046 536.630139 482.741047 ... 1041.749363 1159.747331
15 2164.517695 558.597139 721.841033 ... 2548.803301 2743.534159
21 1198.080826 530.759489 663.639841 ... 1372.536515 1296.604595
28 665.474002 453.753921 236.935726 ... 1001.108816 677.224724
UND \
C1 C2 C3 C4 T1 T2 T3
0 17.126508 86.103158 84.637729 - - 438.8 2004.51
1 856.588602 696.177886 697.322434 684.055 4238.82 3420.76 2339.89
3 523.836538 488.007532 804.293445 467.541 - 666.135 594.047
7 289.235298 272.521236 239.166506 250.247 602.523 449.244 547.401
8 140.495332 77.390391 114.810278 149.386 - - -
10 220.994094 208.610597 233.131489 223.641 1082.33 1115.45 1040.81
14 228.683350 250.932989 213.735624 225.627 623.491 598.308 555.539
15 283.552101 280.930293 293.570089 244.061 791.533 1181.63 1069.91
21 243.737751 233.957191 200.573198 203.905 795.793 903.329 1029.72
28 155.424805 236.211838 197.738949 175.754 728.167 687.443 917.051
T4
0 1555.74
1 3369.02
3 679.208
7 207.939
8 -
10 1043.51
14 602.447
15 844.655
21 958.073
28 572.275
[10 rows x 24 columns]
问题似乎来自重复的列名,例如,示例中的两个 (B, C) 列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.