简体   繁体   中英

Loading pandas table with column names and dtypes

I'm fairly new to using Pandas and I seem to be having some trouble loading a table from a textfile.

Here's an example of what the data looks like:

#    Header text
#    Header text
# id col1 col2 col3 col4
0 0.44:66 0 1600 45.6e-3
1 0.25:7f 0 1600 52.1e-3
2 0.31:5e 0 1600 33.7e-3
...
2500 0.42.6f 0 1400 42.1e-3
# END
# Footer text

I am reading it in as follows:

import pandas as pd

with open(filename, 'rt') as f:
    df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python')

Then when I print(df.dtypes) I get the following:

# id        int64
col1        object
col2        int64
col3        int64
col4        float64
dtype: object

This is fine, except for the # in the name of the first column. So I tried specifying the names:

df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python', 
                   names=["id", "col1", "col2", "col3", "col4"])

but then I get print(df.dtypes)

id          object
col1        object
col2        object
col3        object
col4        object
dtype: object

So I tried specifying both names and dtypes :

df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python', 
                   names=["id", "col1", "col2", "col3", "col4"], 
                   dtypes={"id":int,"col1":str,"col2":int, "col3":int,"col4":float})

but this gives an error:

ValueError: Unable to convert column id to type <class 'int'>

What's wrong? How can I load the table with the column names I want and the appropriate dtypes ?

I have found a workaround solution but I am open to better solutions if they are out there.

I loaded the table without specifying the names or dtypes and then renamed the problematic column name as:

df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python')
df.rename(columns={'# id':'id'}, inplace=True)

Then I used print(df.dtypes) to get the desired output:

id          int64
col1        object
col2        int64
col3        int64
col4        float64
dtype: object

A few comments.

Firstly, I don't understand why your code works at all, given that your columns appear to be separated by whitespace (?). You'd usually require an extra sep=' ' in the call to read_table or read_csv .

Secondly, you don't need to open the file first, you can just pass the filename to the pandas function: pd.read_table(filename, ...)

But to answer your question:

If you specify the column names explicitly with names=[...] and they don't match the header of the file, pandas assumes there is no header. You therefore have to skip an additional row ( skiprows=3 ), or else pandas will assume that line is part of the table data and thus set the data type to object (ie strings) for all columns.

使用类型

df['id'] = df['id'].astype(int)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM