如何在pytables中创建可以存储Unicode字符串的压缩数据集？

Question

I'm using PyTables to store a data array, which works fine; 我正在使用PyTables存储一个数据数组，工作正常; along with it I need to store a moderately large (50K-100K) Unicode string containing JSON data, and I'd like to compress it. 我需要存储一个包含JSON数据的中等大小（50K-100K）的Unicode字符串，我想压缩它。

How can I do this in PyTables? 我怎么能在PyTables中这样做？ It's been a long time since I've worked with HDF5, and I can't remember the right way to store character arrays so they can be compressed. 自从我使用HDF5以来已经很长时间了，我记不起存储字符数组的正确方法，因此它们可以被压缩。 (And I can't seem to find a similar example of doing this on the PyTables website.) （我似乎无法在PyTables网站上找到类似的例子。）

Answer 1

PyTables does not natively support unicode - yet. PyTables本身并不支持unicode。 To store unicode. 存储unicode。 First convert the string to bytes and then store a VLArray of length-1 strings or uint8. 首先将字符串转换为字节，然后存储长度为1的字符串或uint8的VLArray。 To get compression simply instantiate your array with a Filters instance that has a non-zero complevel . 要获得压缩，只需使用具有非零complevel的Filters实例来实例化数组。

All of the examples I know of storing JSON data like this do so using the HDF5 C-API. 我知道存储JSON数据的所有示例都使用HDF5 C-API。

Answer 2

OK, based on Anthony Scopatz's approach, I have a feasible solution. 好的，根据Anthony Scopatz的方法，我有一个可行的解决方案。

def recordStringInHDF5(h5file, group, nodename, s, complevel=5, complib='zlib'):
    '''creates a CArray object in an HDF5 file 
    that represents a unicode string'''

    bytes = np.fromstring(s.encode('utf-8'),np.uint8)
    atom = pt.UInt8Atom()
    filters = pt.Filters(complevel=complevel, complib=complib)
    ca = h5file.create_carray(group, nodename, atom, shape=(len(bytes),),
                               filters=filters)
    ca[:] = bytes
    return ca
def retrieveStringFromHDF5(node):
    return unicode(node.read().tostring(), 'utf-8')

If I run this: 如果我运行这个：

>>> h5file = pt.openFile("test1.h5",'w')
>>> recordStringInHDF5(h5file, h5file.root, 'mrtamb',
    u'\u266b Hey Mr. Tambourine Man \u266b')

/mrtamb (CArray(30,), shuffle, zlib(5)) ''
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := (65536,)

>>> h5file.flush()
>>> h5file.close()
>>> h5file = pt.openFile("test1.h5")
>>> print retrieveStringFromHDF5(h5file.root.mrtamb)

♫ Hey Mr. Tambourine Man ♫

I've been able to run this with strings in the 300kB range and gotten good compression ratios. 我已经能够在300kB范围内使用字符串来运行它并获得良好的压缩比。

如何在pytables中创建可以存储Unicode字符串的压缩数据集？

问题描述

2 个解决方案

解决方案1
3 已采纳 2014-01-15 03:17:56

解决方案2
3 2014-01-15 21:34:30

如何在pytables中创建可以存储Unicode字符串的压缩数据集？

问题描述

2 个解决方案

解决方案1 3 已采纳 2014-01-15 03:17:56

解决方案2 3 2014-01-15 21:34:30

解决方案1
3 已采纳 2014-01-15 03:17:56

解决方案2
3 2014-01-15 21:34:30