简体   繁体   中英

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8

CSV 标题

What I would like to get is:

所需的熊猫头

When I try to load with pd.read_csv(file, header=[0,1], sep=',') , I end up with the following:

结果不正确

Is there a way to get the desired result?


Note: alternatively, I would accept this as a result:

替代结果


Versions used:

  • Python: 2.7.8
  • Pandas 0.16.0

Here is an automated way to fix the column index. First, pull the column level values into a DataFrame:

columns = pd.DataFrame(df.columns.tolist())

then rename the Unnamed: columns to NaN :

columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan

and then forward-fill the NaNs:

columns[0] = columns[0].fillna(method='ffill')

so that columns now looks like

In [314]: columns
Out[314]: 
     0  1
0  NaN  A
1  NaN  B
2    C  X
3    C  Y
4    C  Z
5    D  X
6    D  Y
7    D  Z

Now we can find the remaining NaNs and fill them with empty strings:

mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')

To make the first two columns, A and B , indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:

columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values

Now you can build a new MultiIndex and assign it to df.columns :

df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())

Putting it all together, if data is

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

then

import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
    print(df)

yields

   A  B  C        D      
         X  Y  Z  X  Y  Z
0  1  2  3  4  5  6  7  8
1  3  4  5  6  7  8  9  0

There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:

names = ['A', 'B', 
         ('C','X'), ('C', 'Y'), ('C', 'Z'),
         ('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
            header=1, names=names, index_col=[0, 1])

Gives:

     C        D      
     X  Y  Z  X  Y  Z
A B                  
1 2  3  4  5  6  7  8

To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.

pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])

Then access the columns and loop to create your header. Again, not a very clean solution, but should work.

you can read using :

    df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)

and then

    df.columns = pd.MultiIndex.from_tuples(df.columns)

Load the dataframe, with multiindex:

df = pd.read_csv(filelist,header=[0,1], sep=',')

Write a function to replace the index:

def replace_index(df):
    arr = df.columns.values
    l = [list(x) for x in arr]
    for i in range(len(l)):
        if l[i][0][:7] == 'Unnamed':
            if l[i-1][0][:7] != 'Unnamed':
                l[i][0] = l[i-1][0]
    for i in range(len(l)):
        if l[i][0][:7] == 'Unnamed':
                l[i][0] = l[i][1]
                l[i][1] = ''
    index = pd.MultiIndex.from_tuples(l)
    df.columns = index
    return df

Return the new dataframe properly indexed:

replace_index(df)

I used a technique to flatten from the multi-index columns and make one column. It works well for me.

your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]

Import your csv file providing the header row indexes:

df = pd.read_csv('file.csv', header=[0, 1, 2])

Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)

df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)

this is the quick one liner I was looking for when trying to figure this out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM