简体   繁体   中英

Pandas - Specifying dtype with mixed column data using read_csv

I'm attempting to load in several fairly large CSV's (total: roughly 30M rows / 7GB). Some of the columns are mixed ints and floats - I want these columns as np.float16 .


Ideally, the dtype parameter of read_csv would be used to make the whole importing process more efficient. But, an error is thrown for these mixed data columns.

Here is the code, and corresponding error:

def import_processing(filepath, cols, null_cols):
    result = pd.read_csv(filepath, header = None, names = cols.keys(), dtype = cols)
    result.drop(null_cols, axis = 1, inplace = True)
    return result

data_cols = { 'feature_0' : np.float32,
              'feature_1' : np.float32,
              'feature_2' : np.uint32,
              'feature_3' : np.uint64,
              'feature_4' : np.uint64,
              'feature_5' : np.float16,
              'feature_6' : np.float16,
              'feature_7' : np.float16,
              'feature_8' : np.float16,
              'feature_9' : np.float16,
              'feature_10' : np.float16,
              'feature_11' : np.float16,
              'feature_12' : np.float16,
              'feature_13' : np.float16,
              'feature_14' : np.float16,
              'feature_15' : np.float16,
              'feature_16' : np.float16,
              'feature_17' : np.float16,
              'feature_18' : np.float16,
              'feature_19' : np.float16,
              'feature_20' : np.float16,
              'feature_21' : np.float16,
              'feature_22' : np.float16,
              'feature_23' : np.float16,
              'feature_24' : np.float16,
              'feature_25' : 'M8[ns]',
              'feature_26' : 'M8[ns]',
              'feature_27' : np.uint64,
              'feature_28' : np.uint32,
              'feature_29' : np.uint64,
              'feature_30' : np.uint32}

files = ['./file_0.csv', './file_1.csv', './file_2.csv']
all_data = [import_processing(f, data_cols, ['feature_0', 'feature_1']) for f in files]

TypeError: Cannot cast array from dtype('O') to dtype('float16') according to the rule 'safe'

But, if I don't use the dtype parameter, the speed of importing is greatly slowed down as all the mixed datatype columns are imported as dtype('O') instead of np.float16 .

I've been getting around this by first applying a pd.to_numeric (not sure why this doesn't throw the same error), which converts all columns into np.float64 and then using an astype() conversion to get each column into the type I want it (including those mixed datatype columns to np.float16 ).

This process is very slow, so I was wondering if there was a better way of doing it. Currently, my (very slow) working function looks like this:

def import_processing(filepath, cols, null_cols):
    result = pd.read_csv(filepath, header = None, names = cols.keys())
    result.drop(null_cols, axis = 1, inplace = True)

    for c in null_cols:
        cols.pop(c, None)

    result[result.columns] = result[result.columns].apply(pd.to_numeric, errors='coerce')
    result = result.astype(cols)
    return result

Edit: I have read that utilising Dask (in general) is a much more efficient method of managing large datasets in Python. I've never worked with it before, and as far as I'm aware it handles many operations basically using calls to Pandas - so I imagine it'd have the same datatype issues.

From the error my guess is that one of your columns is not strictly numeric, and that there is some text in your data, hence Pandas interpretting it as an object-dtype column. It's unable to coerce this data to be of type float16. That's just a guess though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM