在python中存储大量数据

Question

Maybe I start will a small introduction for my problem. 也许我开始对我的问题做一个小介绍。 I'm writing a python program which will be used for post-processing of different physical simulations. 我正在写一个python程序，它将用于不同物理模拟的后处理。 Every simulation can create up to 100 GB of output. 每次模拟都可以创建高达100 GB的输出。 I deal with different informations (like positions, fields and densities,...) for different time steps. 我处理不同时间步骤的不同信息（如位置，字段和密度......）。 I would like to have the access to all this data at once which isn't possible because I don't have enough memory on my system. 我希望能够同时访问所有这些数据，这是不可能的，因为我的系统上没有足够的内存。 Normally I use read file and then do some operations and clear the memory. 通常我使用读取文件，然后执行一些操作并清除内存。 Then I read other data and do some operations and clear the memory. 然后我读取其他数据并执行一些操作并清除内存。

Now my problem. 现在我的问题。 If I do it this way, then I spend a lot of time to read data more than once. 如果我这样做，那么我花了很多时间不止一次读取数据。 This take a lot of time. 这需要花费很多时间。 I would like to read it only once and store it for an easy access. 我想只读一次并存储它以方便访问。 Do you know a method to store a lot of data which is really fast or which doesn't need a lot of space. 您是否知道存储大量数据的方法，这些数据非常快或者不需要大量空间。

I just created a method which is around ten times faster then a normal open-read. 我刚刚创建了一个比普通open-read快十倍的方法。 But I use cat (linux command) for that. 但我使用cat （linux命令）。 It's a really dirty method and I would like to kick it out of my script. 这是一个非常脏的方法，我想把它从我的脚本中删除。

Is it possible to use databases to store this data and to get the data faster than normal reading? 是否可以使用数据库来存储此数据并使数据比正常读取更快？ (sorry for this question, but I'm not a computer scientist and I don't have a lot of knowledge behind databases). （抱歉这个问题，但我不是计算机科学家，我在数据库背后没有很多知识）。

EDIT: 编辑：

My cat-code look something like this - only a example: 我的cat-code看起来像这样 - 只是一个例子：

out = string.split(os.popen("cat "+base+"phs/phs01_00023_"+time).read())
# and if I want to have this data as arrays then I normally use and reshape (if I
# need it)
out = array(out)
out = reshape(out)

Normally I would use a numpy Method numpy.loadtxt which need the same time like normal reading.: 通常我会使用numpy方法numpy.loadtxt ，它需要与正常读数相同的时间：

f = open('filename')
f.read()
...

I think that loadtxt just use the normal methods with some additional code lines. 我认为loadtxt只使用常规方法和一些额外的代码行。

I know there are some better ways to read out data. 我知道有一些更好的方法可以读出数据。 But everything what I found was really slow. 但我发现的一切都很慢。 I will now try mmap and hopefully I will have a better performance. 我现在将尝试mmap ，希望我会有更好的表现。

Answer 1

I would try using HDF5 . 我会尝试使用HDF5 。 There are two commonly used Python interfaces, h5py and PyTables . 有两种常用的Python接口， h5py和PyTables 。 While the latter seems to be more widespread, I prefer the former. 虽然后者似乎更普遍，但我更喜欢前者。

Answer 2

If you're on a 64-bit operating system, you can use the mmap module to map that entire file into memory space. 如果您使用的是64位操作系统，则可以使用mmap模块将整个文件映射到内存空间。 Then, reading random bits of the data can be done a lot more quickly since the OS is then responsible for managing your access patterns. 然后，由于操作系统负责管理您的访问模式，因此可以更快地完成读取数据的随机位。 Note that you don't actually need 100 GB of RAM for this to work, since the OS will manage it all in virtual memory. 请注意，实际上您不需要100 GB的RAM才能工作，因为操作系统将在虚拟内存中管理它。

I've done this with a 30 GB file (the Wikipedia XML article dump) on 64-bit FreeBSD 8 with very good results. 我在64位FreeBSD 8上使用30 GB文件（Wikipedia XML文章转储）完成了这项工作，效果非常好。

Answer 3

If you're working with large datasets, Python may not be your best bet. 如果您正在使用大型数据集，Python可能不是您最好的选择。 If you want to use a database like MySQL or Postgres, you should give SQLAlchemy a try. 如果你想使用像MySQL或Postgres这样的数据库，你应该试试SQLAlchemy 。 It makes it quite easy to work with potentially large datasets using small Python objects. 使用小型Python对象可以很容易地处理潜在的大型数据集。 For example, if you use a definition like this: 例如，如果您使用这样的定义：

from datetime import datetime
from sqlalchemy import Column, DateTime, Enum, ForeignKey, Integer, \
    MetaData, PickleType, String, Text, Table, LargeBinary
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import column_property, deferred, object_session, \
    relation, backref

SqlaBaseClass = declarative_base()

class MyDataObject(SqlaBaseClass):
  __tablename__ = 'datarows'
  eltid         = Column(Integer, primary_key=True)
  name          = Column(String(50, convert_unicode=True), nullable=False, unique=True, index=True)
  created       = Column(DateTime)
  updated       = Column(DateTime, default=datetime.today)

  mylargecontent = deferred(Column(LargeBinary))

  def __init__(self, name):
      self.name    = name
      self.created = datetime.today()

  def __repr__(self):
      return "<MyDataObject name='%s'>" %(self.name,)

Then you can easily access all rows using small data objects: 然后，您可以使用小型数据对象轻松访问所有行：

# set up database connection; open dbsession; ... 

for elt in dbsession.query(MyDataObject).all():
    print elt.eltid # does not access mylargecontent

    if (something(elt)):
        process(elt.mylargecontent) # now large binary is pulled from db server
                                    # on demand

I guess the point is: you can add as many fields to your data as you want, adding indexes as needed to speed up your search. 我想重点是：您可以根据需要为数据添加任意数量的字段，根据需要添加索引以加快搜索速度。 And, most importantly, when you work with a MyDataObject , you can make potentially large fields deferred so that they are loaded only when you need them. 而且，最重要的是，当您使用MyDataObject ，可以deferred使用潜在的大字段，以便仅在需要时加载它们。

在python中存储大量数据

问题描述

3 个解决方案

解决方案1
7 2011-03-10 18:47:32

解决方案2
7 已采纳 2011-03-10 18:52:47

解决方案3
0 2011-03-10 18:47:06

在python中存储大量数据

问题描述

3 个解决方案

解决方案1 7 2011-03-10 18:47:32

解决方案2 7 已采纳 2011-03-10 18:52:47

解决方案3 0 2011-03-10 18:47:06

解决方案1
7 2011-03-10 18:47:32

解决方案2
7 已采纳 2011-03-10 18:52:47

解决方案3
0 2011-03-10 18:47:06