简体   繁体   English

如何在 Python 中快速加载大型数据集?

[英]How do I make large datasets load quickly in Python?

I do data mining research and often have Python scripts that load large datasets from SQLite databases, CSV files, pickle files, etc. In the development process, my scripts often need to be changed and I find myself waiting 20 to 30 seconds waiting for data to load.我做数据挖掘研究,经常有从 SQLite 数据库、CSV 文件、pickle 文件等加载大型数据集的 Python 脚本。 在开发过程中,我的脚本经常需要更改,我发现自己等待数据需要 20 到 30 秒加载。

Loading data streams (eg from a SQLite database) sometimes works, but not in all situations -- if I need to go back into a dataset often, I'd rather pay the upfront time cost of loading the data.加载数据流(例如从 SQLite 数据库)有时有效,但并非在所有情况下都有效——如果我需要经常返回数据集,我宁愿支付加载数据的前期时间成本。

My best solution so far is subsampling the data until I'm happy with my final script.到目前为止,我最好的解决方案是对数据进行二次采样,直到我对最终脚本感到满意为止。 Does anyone have a better solution/design practice?有没有人有更好的解决方案/设计实践?

My "ideal" solution would involve using the Python debugger (pdb) cleverly so that the data remains loaded in memory, I can edit my script, and then resume from a given point.我的“理想”解决方案是巧妙地使用 Python 调试器 (pdb),以便数据保持加载在内存中,我可以编辑我的脚本,然后从给定的点恢复。

One way to do this would be to keep your loading and manipulation scripts in separate files X and Y and have X.py read一种方法是将加载和操作脚本保存在单独的文件 X 和 Y 中,并读取X.py

import Y
data = Y.load()
.... your code ....

When you're coding X.py , you omit this part from the file and manually run it in an interactive shell.当您编码X.py ,您会从文件中省略这部分并在交互式 shell 中手动运行它。 Then you can modify X.py and do an import X in the shell to test your code.然后你可以修改X.py并在 shell 中做一个import X来测试你的代码。

Write a script that does the selects, the object-relational conversions, then pickles the data to a local file.编写一个脚本来执行选择、对象关系转换,然后将数据腌制到本地文件。 Your development script will start by unpickling the data and proceeding.您的开发脚本将首先取消数据并继续。

If the data is significantly smaller than physical RAM, you can memory map a file shared between two processes, and write the pickled data to memory.如果数据明显小于物理 RAM,您可以内存映射两个进程共享的文件,并将pickled 数据写入内存。

Jupyter notebook allows you to load a large data set into a memory resident data structure, such as a Pandas dataframe in one cell. Jupyter notebook 允许您将大型数据集加载到内存驻留数据结构中,例如一个单元格中的 Pandas 数据帧。 Then you can operate on that data structure in subsequent cells without having to reload the data.然后,您可以在后续单元格中对该数据结构进行操作,而无需重新加载数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:如何快速加载大型音频数据集 - Python: How to load large audio data set quickly 如何将图像数据集加载到 TensorFlow 中? - How do I load image datasets into TensorFlow? 如何在python中为大型数据集制作朴素贝叶斯分类器 - How to make Naive Bayes classifier for large datasets in python 如何在python中快速求和大型numpy数组? - How do you quickly sum large numpy arrays in python? 如何在Python中有效地计算非常大的数据集的基数? - How do you count cardinality of very large datasets efficiently in Python? 如何在 Python 中导入数据集? - How do i import datasets in Python? 如何加快Django中大型数据集的迭代速度 - How do I speed up iteration of large datasets in Django 如果我想使用无法通过TensorFlow加载到内存中的大型数据集,我该怎么办? - What should I do if I want to use large datasets that can't load into the memory with TensorFlow? 如何保存大型 Python numpy 数据集? - How to save large Python numpy datasets? 如何为具有大型数据集的python多处理选择chunksize - How to pick a chunksize for python multiprocessing with large datasets
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM