如何在 Python 中快速加载大型数据集？

Question

I do data mining research and often have Python scripts that load large datasets from SQLite databases, CSV files, pickle files, etc. In the development process, my scripts often need to be changed and I find myself waiting 20 to 30 seconds waiting for data to load.我做数据挖掘研究，经常有从 SQLite 数据库、CSV 文件、pickle 文件等加载大型数据集的 Python 脚本。在开发过程中，我的脚本经常需要更改，我发现自己等待数据需要 20 到 30 秒加载。

Loading data streams (eg from a SQLite database) sometimes works, but not in all situations -- if I need to go back into a dataset often, I'd rather pay the upfront time cost of loading the data.加载数据流（例如从 SQLite 数据库）有时有效，但并非在所有情况下都有效——如果我需要经常返回数据集，我宁愿支付加载数据的前期时间成本。

My best solution so far is subsampling the data until I'm happy with my final script.到目前为止，我最好的解决方案是对数据进行二次采样，直到我对最终脚本感到满意为止。 Does anyone have a better solution/design practice?有没有人有更好的解决方案/设计实践？

My "ideal" solution would involve using the Python debugger (pdb) cleverly so that the data remains loaded in memory, I can edit my script, and then resume from a given point.我的“理想”解决方案是巧妙地使用 Python 调试器 (pdb)，以便数据保持加载在内存中，我可以编辑我的脚本，然后从给定的点恢复。

Answer 1

One way to do this would be to keep your loading and manipulation scripts in separate files X and Y and have X.py read一种方法是将加载和操作脚本保存在单独的文件 X 和 Y 中，并读取X.py

import Y
data = Y.load()
.... your code ....

When you're coding X.py , you omit this part from the file and manually run it in an interactive shell.当您编码X.py ，您会从文件中省略这部分并在交互式 shell 中手动运行它。 Then you can modify X.py and do an import X in the shell to test your code.然后你可以修改X.py并在 shell 中做一个import X来测试你的代码。

Answer 2

Write a script that does the selects, the object-relational conversions, then pickles the data to a local file.编写一个脚本来执行选择、对象关系转换，然后将数据腌制到本地文件。 Your development script will start by unpickling the data and proceeding.您的开发脚本将首先取消数据并继续。

If the data is significantly smaller than physical RAM, you can memory map a file shared between two processes, and write the pickled data to memory.如果数据明显小于物理 RAM，您可以内存映射两个进程共享的文件，并将pickled 数据写入内存。

Answer 3

Jupyter notebook allows you to load a large data set into a memory resident data structure, such as a Pandas dataframe in one cell. Jupyter notebook 允许您将大型数据集加载到内存驻留数据结构中，例如一个单元格中的 Pandas 数据帧。 Then you can operate on that data structure in subsequent cells without having to reload the data.然后，您可以在后续单元格中对该数据结构进行操作，而无需重新加载数据。

如何在 Python 中快速加载大型数据集？

问题描述

3 个解决方案

解决方案1
3 已采纳 2013-01-16 23:34:06

解决方案2
0 2013-01-16 23:35:24

解决方案3
0 2020-08-07 10:52:47

如何在 Python 中快速加载大型数据集？

问题描述

3 个解决方案

解决方案1 3 已采纳 2013-01-16 23:34:06

解决方案2 0 2013-01-16 23:35:24

解决方案3 0 2020-08-07 10:52:47

解决方案1
3 已采纳 2013-01-16 23:34:06

解决方案2
0 2013-01-16 23:35:24

解决方案3
0 2020-08-07 10:52:47