如何使用 python 处理 memory 中的大数据？

Question

I have a data set that is larger than my memory.我有一个比我的 memory 大的数据集。 In general, I have to loop through 350 points and each point is a data set of about 80 Gb in size.一般来说，我必须遍历 350 个点，每个点是一个大小约为 80 Gb 的数据集。 Usually i get around this by just dealing one file at the time, but now I'm performing a computation that requires me to load all the data at once.通常我一次只处理一个文件来解决这个问题，但现在我正在执行一个需要我一次加载所有数据的计算。 I'm looking for suggestions of how to tackle this problem.我正在寻找有关如何解决此问题的建议。 Already been reading a bit about dask and pyspark, but not sure is what I need.已经阅读了一些关于 dask 和 pyspark 的内容，但不确定我需要什么。 Can't divide my data into chunks due to the fact that I'm performing a PCA (principal component analysis) of the data so I need to perform the calculation over the whole of it, the data are velocity fields, not tables.由于我正在对数据执行 PCA（主成分分析），因此无法将我的数据分成块，因此我需要对整个数据执行计算，数据是速度场，而不是表格。 Perhaps changing the float format of the array in memory could work or any other trick to compress the array in memory.也许改变 memory 中数组的浮点格式可以工作或任何其他技巧来压缩 memory 中的数组。 All the files at each point are in pickle format and are 3200 files, giving a total of about 32 Tb of data.每个点的所有文件都是 pickle 格式，共有 3200 个文件，总共提供了大约 32 Tb 的数据。

I have 64 Gb of RAM and a CPU with 32 cores.我有 64 Gb 的 RAM 和一个 32 核的 CPU。

Any guidance over this issue is very much appreciated.非常感谢有关此问题的任何指导。

Answer 1

In general you can use data generators for this.通常，您可以为此使用数据生成器。 That allows you to consume a dataset without loading the complete dataset in memory.这允许您在不加载 memory 中的完整数据集的情况下使用数据集。

In practice you can use TensorFlow.在实践中，您可以使用 TensorFlow。 For the data generator use:对于数据生成器，请使用：

tf.data.Dataset.from_generator

( https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator ) （ https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator ）

And to apply PCA: tft.pca ( https://www.tensorflow.org/tfx/transform/api_docs/python/tft/pca )并应用 PCA： tft.pca ( https://www.tensorflow.org/tfx/transform/api_docs/python/tft/pca )

如何使用 python 处理 memory 中的大数据？

问题描述

1 个解决方案

解决方案1
0 2022-07-28 15:37:11

如何使用 python 处理 memory 中的大数据？

问题描述

1 个解决方案

解决方案1 0 2022-07-28 15:37:11

解决方案1
0 2022-07-28 15:37:11