简体繁体 English

熊猫性能：在一列中有多个dtype还是分成不同的dtype？

[英]Pandas performance: Multiple dtypes in one column or split into different dtypes?

原文 2014-05-21 13:25:16 1 1 python/ pandas

I have huge pandas DataFrames I work with. 我有很多与我合作的熊猫DataFrame。 20mm rows, 30 columns. 20毫米行30列 The rows have a lot of data, and each row has a "type" that uses certain columns. 这些行包含大量数据，每行都有一个使用某些列的“类型”。 Because of this, I've currently designed the DataFrame to have some columns that are mixed dtypes for whichever 'type' the row is. 因此，我目前已将DataFrame设计为具有一些列，这些列混合了dtype以用于该行的“类型”。

My question is, performance wise, should I split out mixed dtype columns into two separate columns or keep them as one? 我的问题是，从性能角度来看，我应该将混合的dtype列拆分为两个单独的列还是将它们保持为一体？ I'm running into problems getting some of these DataFrames to even save(to_pickle) and trying to be as efficient as possible. 我遇到了使其中一些DataFrame甚至保存（to_pickle）并试图尽可能高效的问题。

The columns could be mixes of float/str, float/int, float/int/str as currently constructed. 列可以是当前构造的float / str，float / int，float / int / str的混合。

1 个解决方案

Seems to me that it may depend on what your subsequent use case is. 在我看来，这可能取决于您的后续用例。 But IMHO I would make each column unique type otherwise functions such as group by with totals and other common Pandas functions simply won't work. 但是恕我直言，我会让每个列都成为唯一的类型，否则诸如汇总功能之类的函数以及其他常见的Pandas函数将根本无法工作。