I'm attempting to load in several fairly large CSV's (total: roughly 30M rows / 7GB). Some of the columns are mixed ints
and floats
- I want these columns as np.float16
.
Ideally, the dtype
parameter of read_csv
would be used to make the whole importing process more efficient. But, an error is thrown for these mixed data columns.
Here is the code, and corresponding error:
def import_processing(filepath, cols, null_cols):
result = pd.read_csv(filepath, header = None, names = cols.keys(), dtype = cols)
result.drop(null_cols, axis = 1, inplace = True)
return result
data_cols = { 'feature_0' : np.float32,
'feature_1' : np.float32,
'feature_2' : np.uint32,
'feature_3' : np.uint64,
'feature_4' : np.uint64,
'feature_5' : np.float16,
'feature_6' : np.float16,
'feature_7' : np.float16,
'feature_8' : np.float16,
'feature_9' : np.float16,
'feature_10' : np.float16,
'feature_11' : np.float16,
'feature_12' : np.float16,
'feature_13' : np.float16,
'feature_14' : np.float16,
'feature_15' : np.float16,
'feature_16' : np.float16,
'feature_17' : np.float16,
'feature_18' : np.float16,
'feature_19' : np.float16,
'feature_20' : np.float16,
'feature_21' : np.float16,
'feature_22' : np.float16,
'feature_23' : np.float16,
'feature_24' : np.float16,
'feature_25' : 'M8[ns]',
'feature_26' : 'M8[ns]',
'feature_27' : np.uint64,
'feature_28' : np.uint32,
'feature_29' : np.uint64,
'feature_30' : np.uint32}
files = ['./file_0.csv', './file_1.csv', './file_2.csv']
all_data = [import_processing(f, data_cols, ['feature_0', 'feature_1']) for f in files]
TypeError: Cannot cast array from dtype('O') to dtype('float16') according to the rule 'safe'
But, if I don't use the dtype
parameter, the speed of importing is greatly slowed down as all the mixed datatype columns are imported as dtype('O')
instead of np.float16
.
I've been getting around this by first applying a pd.to_numeric
(not sure why this doesn't throw the same error), which converts all columns into np.float64
and then using an astype()
conversion to get each column into the type I want it (including those mixed datatype columns to np.float16
).
This process is very slow, so I was wondering if there was a better way of doing it. Currently, my (very slow) working function looks like this:
def import_processing(filepath, cols, null_cols):
result = pd.read_csv(filepath, header = None, names = cols.keys())
result.drop(null_cols, axis = 1, inplace = True)
for c in null_cols:
cols.pop(c, None)
result[result.columns] = result[result.columns].apply(pd.to_numeric, errors='coerce')
result = result.astype(cols)
return result
Edit: I have read that utilising Dask (in general) is a much more efficient method of managing large datasets in Python. I've never worked with it before, and as far as I'm aware it handles many operations basically using calls to Pandas - so I imagine it'd have the same datatype issues.
From the error my guess is that one of your columns is not strictly numeric, and that there is some text in your data, hence Pandas interpretting it as an object-dtype column. It's unable to coerce this data to be of type float16. That's just a guess though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.