简体繁体 English

提高超大数据帧的 Pandas 性能？

[英]Improve Pandas performance for very large dataframes?

原文 2023-01-19 00:47:16 0 1 python/ pandas/ dataframe/ performance

I have a few Pandas dataframes with several millions of rows each.我有几个 Pandas 数据框，每个数据框有几百万行。 The dataframes have columns containing JSON objects each with 100+ fields.数据框的列包含 JSON 个对象，每个对象有 100 多个字段。 I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added.我有一组 24 个函数，它们在数据帧上按顺序运行，处理 JSON（例如，计算 JSON 中两个字段之间的一些字符串距离）并返回一个添加了一些新字段的 JSON。 After all 24 functions execute, I get a final JSON which is then usable for my purposes.执行所有 24 个函数后，我得到一个最终的 JSON，然后可用于我的目的。

I am wondering what the best ways to speed up performance for this dataset.我想知道加快此数据集性能的最佳方法是什么。 A few things I have considered and read up on:我已经考虑并阅读了一些事情：

It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".向量化很棘手，因为许多操作并不像“从另一列的值中减去这一列的值”那么简单。
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).我阅读了一些 Pandas 文档，指出的几个选项是 Cython（将字符串编辑距离转换为 Cython 可能很棘手，特别是因为我使用的是外部 Python 包）和 Numba/JIT（但这被提到是仅适用于数值计算）。
Possibly controlling the number of threads could be an option.控制线程数可能是一种选择。 The 24 functions can mostly operate without any dependencies on each other.这 24 个功能大部分可以在彼此没有任何依赖性的情况下运行。

1 个解决方案

You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.您正在寻求建议，这不是提供一般建议的最佳网站，但我会尝试指出一些事情。

The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.您已经考虑过的想法不会有帮助 - Cython、Numba 和线程都不会解决主要问题 - 不利于数据操作性能的数据格式。
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts).我建议您首先“解压”存储在 dataframe 列中的 JSON。最好，JSON 的每个字段（强制或可选 - 在此阶段处理空值）最终成为一列dataframe。如果有嵌套字典，您可能需要考虑拆分 dataframe（特别是如果 24 个函数分别在单独的嵌套 JSON 字典中工作）。 Alternatively, you should strive to flatten the JSONs.或者，您应该努力扁平化 JSON。
Convert to the data format that gives you the best performance.转换为可提供最佳性能的数据格式。 JSON stores all the data in the textual format. JSON 以文本格式存储所有数据。 Numbers are best used in their binary format.数字最好以二进制格式使用。 You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) ( works on the whole dataframe too).您可以在您怀疑应该使用df['col'].astype(...)转换的列上逐列执行此操作（也适用于整个 dataframe ）。
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.更新 24 个函数，不再对 dataframe 中存储的 JSON 个字符串进行操作，而是对 dataframe 的字段进行操作。
Recombine the JSONs for storage (I assume you need them in this format).重新组合 JSON 以进行存储（我假设您需要这种格式的它们）。 At this stage the implicit conversion from numbers to strings will occur.在此阶段，将发生从数字到字符串的隐式转换。

Given the level of details you provided in the question, the suggestions are necessarily brief.考虑到您在问题中提供的详细程度，建议一定很简短。 Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE ).如果您对上述任何一点有任何更详细的问题，最好对每个问题提出最简单的问题（最好包含一个自给自足的MWE ）。