简体   繁体   English

Pandas read_csv 与 MultiIndex 列

[英]Pandas read_csv with MultiIndex columns

I have a csv file that looks like this:我有一个看起来像这样的 csv 文件:

;a1;;;;;;a2;;;;;
;b1;;;b2;;;b1;;;b2;;
;c1;c2;c3;c1;c2;c3;c1;c2;c3;c1;c2;c3
0;0.9803;0.6223;0.3398;0.1376;0.3197;0.4410;0.9854;0.2557;0.4300;0.2170;0.4303;0.2307
1;0.1125;0.2934;0.8716;0.4591;0.4254;0.1810;0.6816;0.7632;0.7135;0.1945;0.0215;0.1310
2;0.1479;0.3473;0.1396;0.1298;0.9051;0.7637;0.9413;0.0467;0.9106;0.2931;0.0108;0.0220
3;0.6559;0.3842;0.8389;0.4315;0.2748;0.2193;0.9306;0.6496;0.6549;0.0835;0.8225;0.0136

When read with pandas I get:当使用 pandas 阅读时,我得到:

df = pd.read_csv(file_path, delimiter=";", header=[0,1,2], index_col=0)

print(df)

       a1 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0      a2 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0
       b1 Unnamed: 2_level_1                 b2 Unnamed: 4_level_1      b1 Unnamed: 6_level_1                 b2 Unnamed: 8_level_1
       c1                 c2                 c1                 c2      c1                 c2                 c1                 c2
0  0.6979             0.1863             0.4639             0.3777  0.7896             0.3321             0.8255             0.1357
1  0.8593             0.4796             0.4800             0.6605  0.3322             0.8397             0.5421             0.5000
2  0.0205             0.0679             0.3378             0.0636  0.9365             0.4386             0.4939             0.9106
3  0.0052             0.2623             0.8616             0.6671  0.6522             0.8673             0.0300             0.6935

How can I make pandas recognize headers as a MultiIndex and get this output with no unnamed columns?如何让 pandas 将标头识别为 MultiIndex 并获得没有未命名列的 output?

       a1                                                               a2
       b1                                    b2                         b1                              b2
       c1                 c2                 c1                 c2      c1                 c2                 c1                 c2
0  0.6979             0.1863             0.4639             0.3777  0.7896             0.3321             0.8255             0.1357
1  0.8593             0.4796             0.4800             0.6605  0.3322             0.8397             0.5421             0.5000
2  0.0205             0.0679             0.3378             0.0636  0.9365             0.4386             0.4939             0.9106
3  0.0052             0.2623             0.8616             0.6671  0.6522             0.8673             0.0300             0.6935

Thanks guys!多谢你们!

I think any decent solution here will have to make use of pandas.MultiIndex in some way.我认为这里任何体面的解决方案都必须以某种方式使用pandas.MultiIndex

What you can do is read the header lines ( nrows=3 ) separately into a DataFrame and convert that to a list of lists which can be passed to pandas.MultiIndex.from_arrays() .您可以做的是将 header 行( nrows=3 )分别读入DataFrame并将其转换为可以传递给pandas.MultiIndex.from_arrays()的列表列表。

The trick is to set the option keep_default_na to False so that the NaN values are blanked out and don't appear in the resulting headers.诀窍是将选项keep_default_na设置为False ,以便NaN值被清除并且不会出现在结果标题中。

headers = pd.read_csv(file_path, header=None, nrows=3, delimiter=';',
                      index_col=0, keep_default_na=False).values.tolist()
df = pd.read_csv(file_path, delimiter=';', header=[0, 1, 2], index_col=0)
df.columns = pd.MultiIndex.from_arrays(headers)
print(df)

This gives the desired output:这给出了所需的 output:

       a1                                              a2                                        
       b1                      b2                      b1                      b2                
       c1      c2      c3      c1      c2      c3      c1      c2      c3      c1      c2      c3
0  0.9803  0.6223  0.3398  0.1376  0.3197  0.4410  0.9854  0.2557  0.4300  0.2170  0.4303  0.2307
1  0.1125  0.2934  0.8716  0.4591  0.4254  0.1810  0.6816  0.7632  0.7135  0.1945  0.0215  0.1310
2  0.1479  0.3473  0.1396  0.1298  0.9051  0.7637  0.9413  0.0467  0.9106  0.2931  0.0108  0.0220
3  0.6559  0.3842  0.8389  0.4315  0.2748  0.2193  0.9306  0.6496  0.6549  0.0835  0.8225  0.0136

In theory, you could also devise a solution that only reads the file once, and then does some manipulation of the headers in the case that "Unnamed" appears -- but such an approach would be less reliable (you shouldn't assume the header format in general).从理论上讲,您还可以 devise 一种解决方案,该解决方案只读取一次文件,然后在出现“未命名”的情况下对标头进行一些操作 - 但这种方法不太可靠(您不应该假设 header一般格式)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM