简体   繁体   English

如何使用 python 处理 memory 中的大数据?

[英]How can I handle large data in memory using python?

I have a data set that is larger than my memory.我有一个比我的 memory 大的数据集。 In general, I have to loop through 350 points and each point is a data set of about 80 Gb in size.一般来说,我必须遍历 350 个点,每个点是一个大小约为 80 Gb 的数据集。 Usually i get around this by just dealing one file at the time, but now I'm performing a computation that requires me to load all the data at once.通常我一次只处理一个文件来解决这个问题,但现在我正在执行一个需要我一次加载所有数据的计算。 I'm looking for suggestions of how to tackle this problem.我正在寻找有关如何解决此问题的建议。 Already been reading a bit about dask and pyspark, but not sure is what I need.已经阅读了一些关于 dask 和 pyspark 的内容,但不确定我需要什么。 Can't divide my data into chunks due to the fact that I'm performing a PCA (principal component analysis) of the data so I need to perform the calculation over the whole of it, the data are velocity fields, not tables.由于我正在对数据执行 PCA(主成分分析),因此无法将我的数据分成块,因此我需要对整个数据执行计算,数据是速度场,而不是表格。 Perhaps changing the float format of the array in memory could work or any other trick to compress the array in memory.也许改变 memory 中数组的浮点格式可以工作或任何其他技巧来压缩 memory 中的数组。 All the files at each point are in pickle format and are 3200 files, giving a total of about 32 Tb of data.每个点的所有文件都是 pickle 格式,共有 3200 个文件,总共提供了大约 32 Tb 的数据。

I have 64 Gb of RAM and a CPU with 32 cores.我有 64 Gb 的 RAM 和一个 32 核的 CPU。

Any guidance over this issue is very much appreciated.非常感谢有关此问题的任何指导。

In general you can use data generators for this.通常,您可以为此使用数据生成器。 That allows you to consume a dataset without loading the complete dataset in memory.这允许您在不加载 memory 中的完整数据集的情况下使用数据集。

In practice you can use TensorFlow.在实践中,您可以使用 TensorFlow。 For the data generator use:对于数据生成器,请使用:

tf.data.Dataset.from_generator

( https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator ) https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator

And to apply PCA: tft.pca ( https://www.tensorflow.org/tfx/transform/api_docs/python/tft/pca )并应用 PCA: tft.pca ( https://www.tensorflow.org/tfx/transform/api_docs/python/tft/pca )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Python中处理大量内存? - How to handle large memory footprint in Python? 使用Python / PyGreSQL,如何有效处理大型结果集? - Using Python/PyGreSQL, how can I efficiently handle a large result set? memcached可以有效处理多少个大数据? - How large data can memcached handle efficiently? 总结Python中不同类的几个大型数据结构; 如何在减少内存使用的同时合并和存储所需的数据? - Combing through several large data structures of different classes in Python; How can I combine and store data I need while reducing memory usage? 在python中处理大型数据池 - Handle large data pools in python 在Python中解析大量数据时,如何处理索引超出范围的错误? - How do I handle index out of range errors when parsing large quantities of data in Python? 使用 python 处理大量数据的有效方法是什么? - What is an efficient way to handle large sets of data using python? 如何处理google-app-engine中的大型帖子数据? - How can I handle large post data in google-app-engine? 如何使用Python替换文件中的大部分文本? - How can I replace a large portion of a text in a file using Python? 如何使用OpenCV在Python中管理大型图像? - How can I manage large images in Python using OpenCV?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM