简体   繁体   中英

Set data type for specific column when using read_csv from pandas

I have a large csv file (~10GB), with around 4000 columns. I know that most of data i will expect is int8, so i set:

pandas.read_csv('file.dat', sep=',', engine='c', header=None, 
                na_filter=False, dtype=np.int8, low_memory=False)

Thing is, the final column (4000th position) is int32, is there away can i tell read_csv that use int8 by default, and at column 4000th, use int 32?

Thank you

If you are certain of the number you could recreate the dictionary like this:

dtype = dict(zip(range(4000),['int8' for _ in range(3999)] + ['int32']))

Considering that this works:

import pandas as pd
import numpy as np
​
data = '''\
1,2,3
4,5,6'''
​
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, dtype={0:'int8',1:'int8',2:'int32'}, header=None)
​
print(df.dtypes)

Returns:

0     int8
1     int8
2    int32
dtype: object

From the docs:

dtype : Type name or dict of column -> type, default None

Data type for data or columns. Eg {'a': np.float64, 'b': np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Since you have no header, the column names are the integer order in which they occur, ie the first column is df[0] . To programmatically set the last column to be int32 , you can read the first line of the file to get the width of the dataframe, then construct a dictionary of the integer types you want to use with the number of the columns as the keys.

import numpy as np
import pandas as pd

with open('file.dat') as fp:
    width = len(fp.readline().strip().split(','))
    dtypes = {i: np.int8 for i in range(width)}
    # update the last column's dtype
    dtypes[width-1] = np.int32

    # reset the read position of the file pointer
    fp.seek(0)
    df = pd.read_csv(fp, sep=',', engine='c', header=None, 
                     na_filter=False, dtype=dtypes, low_memory=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM