I have a large csv file (~10GB), with around 4000 columns. I know that most of data i will expect is int8, so i set:
pandas.read_csv('file.dat', sep=',', engine='c', header=None,
na_filter=False, dtype=np.int8, low_memory=False)
Thing is, the final column (4000th position) is int32, is there away can i tell read_csv that use int8 by default, and at column 4000th, use int 32?
Thank you
If you are certain of the number you could recreate the dictionary like this:
dtype = dict(zip(range(4000),['int8' for _ in range(3999)] + ['int32']))
Considering that this works:
import pandas as pd
import numpy as np
data = '''\
1,2,3
4,5,6'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, dtype={0:'int8',1:'int8',2:'int32'}, header=None)
print(df.dtypes)
Returns:
0 int8
1 int8
2 int32
dtype: object
From the docs:
dtype : Type name or dict of column -> type, default None
Data type for data or columns. Eg {'a': np.float64, 'b': np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
Since you have no header, the column names are the integer order in which they occur, ie the first column is df[0]
. To programmatically set the last column to be int32
, you can read the first line of the file to get the width of the dataframe, then construct a dictionary of the integer types you want to use with the number of the columns as the keys.
import numpy as np
import pandas as pd
with open('file.dat') as fp:
width = len(fp.readline().strip().split(','))
dtypes = {i: np.int8 for i in range(width)}
# update the last column's dtype
dtypes[width-1] = np.int32
# reset the read position of the file pointer
fp.seek(0)
df = pd.read_csv(fp, sep=',', engine='c', header=None,
na_filter=False, dtype=dtypes, low_memory=False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.