简体   繁体   中英

Pandas performance: Multiple dtypes in one column or split into different dtypes?

I have huge pandas DataFrames I work with. 20mm rows, 30 columns. The rows have a lot of data, and each row has a "type" that uses certain columns. Because of this, I've currently designed the DataFrame to have some columns that are mixed dtypes for whichever 'type' the row is.

My question is, performance wise, should I split out mixed dtype columns into two separate columns or keep them as one? I'm running into problems getting some of these DataFrames to even save(to_pickle) and trying to be as efficient as possible.

The columns could be mixes of float/str, float/int, float/int/str as currently constructed.

Seems to me that it may depend on what your subsequent use case is. But IMHO I would make each column unique type otherwise functions such as group by with totals and other common Pandas functions simply won't work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM