[英]Pandas read_csv with MultiIndex columns
I have a csv file that looks like this:我有一个看起来像这样的 csv 文件:
;a1;;;;;;a2;;;;;
;b1;;;b2;;;b1;;;b2;;
;c1;c2;c3;c1;c2;c3;c1;c2;c3;c1;c2;c3
0;0.9803;0.6223;0.3398;0.1376;0.3197;0.4410;0.9854;0.2557;0.4300;0.2170;0.4303;0.2307
1;0.1125;0.2934;0.8716;0.4591;0.4254;0.1810;0.6816;0.7632;0.7135;0.1945;0.0215;0.1310
2;0.1479;0.3473;0.1396;0.1298;0.9051;0.7637;0.9413;0.0467;0.9106;0.2931;0.0108;0.0220
3;0.6559;0.3842;0.8389;0.4315;0.2748;0.2193;0.9306;0.6496;0.6549;0.0835;0.8225;0.0136
When read with pandas I get:当使用 pandas 阅读时,我得到:
df = pd.read_csv(file_path, delimiter=";", header=[0,1,2], index_col=0)
print(df)
a1 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 a2 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0
b1 Unnamed: 2_level_1 b2 Unnamed: 4_level_1 b1 Unnamed: 6_level_1 b2 Unnamed: 8_level_1
c1 c2 c1 c2 c1 c2 c1 c2
0 0.6979 0.1863 0.4639 0.3777 0.7896 0.3321 0.8255 0.1357
1 0.8593 0.4796 0.4800 0.6605 0.3322 0.8397 0.5421 0.5000
2 0.0205 0.0679 0.3378 0.0636 0.9365 0.4386 0.4939 0.9106
3 0.0052 0.2623 0.8616 0.6671 0.6522 0.8673 0.0300 0.6935
How can I make pandas recognize headers as a MultiIndex and get this output with no unnamed columns?如何让 pandas 将标头识别为 MultiIndex 并获得没有未命名列的 output?
a1 a2
b1 b2 b1 b2
c1 c2 c1 c2 c1 c2 c1 c2
0 0.6979 0.1863 0.4639 0.3777 0.7896 0.3321 0.8255 0.1357
1 0.8593 0.4796 0.4800 0.6605 0.3322 0.8397 0.5421 0.5000
2 0.0205 0.0679 0.3378 0.0636 0.9365 0.4386 0.4939 0.9106
3 0.0052 0.2623 0.8616 0.6671 0.6522 0.8673 0.0300 0.6935
Thanks guys!多谢你们!
I think any decent solution here will have to make use of pandas.MultiIndex
in some way.我认为这里任何体面的解决方案都必须以某种方式使用
pandas.MultiIndex
。
What you can do is read the header lines ( nrows=3
) separately into a DataFrame
and convert that to a list of lists which can be passed to pandas.MultiIndex.from_arrays()
.您可以做的是将 header 行(
nrows=3
)分别读入DataFrame
并将其转换为可以传递给pandas.MultiIndex.from_arrays()
的列表列表。
The trick is to set the option keep_default_na
to False
so that the NaN
values are blanked out and don't appear in the resulting headers.诀窍是将选项
keep_default_na
设置为False
,以便NaN
值被清除并且不会出现在结果标题中。
headers = pd.read_csv(file_path, header=None, nrows=3, delimiter=';',
index_col=0, keep_default_na=False).values.tolist()
df = pd.read_csv(file_path, delimiter=';', header=[0, 1, 2], index_col=0)
df.columns = pd.MultiIndex.from_arrays(headers)
print(df)
This gives the desired output:这给出了所需的 output:
a1 a2
b1 b2 b1 b2
c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
0 0.9803 0.6223 0.3398 0.1376 0.3197 0.4410 0.9854 0.2557 0.4300 0.2170 0.4303 0.2307
1 0.1125 0.2934 0.8716 0.4591 0.4254 0.1810 0.6816 0.7632 0.7135 0.1945 0.0215 0.1310
2 0.1479 0.3473 0.1396 0.1298 0.9051 0.7637 0.9413 0.0467 0.9106 0.2931 0.0108 0.0220
3 0.6559 0.3842 0.8389 0.4315 0.2748 0.2193 0.9306 0.6496 0.6549 0.0835 0.8225 0.0136
In theory, you could also devise a solution that only reads the file once, and then does some manipulation of the headers in the case that "Unnamed" appears -- but such an approach would be less reliable (you shouldn't assume the header format in general).从理论上讲,您还可以 devise 一种解决方案,该解决方案只读取一次文件,然后在出现“未命名”的情况下对标头进行一些操作 - 但这种方法不太可靠(您不应该假设 header一般格式)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.