统一调整5千兆字节的numpy数据

Question

I'm training a neural network with about five gigabytes of data stored as numpy arrays. 我正在训练一个神经网络，大约有5千兆字节的数据存储为numpy数组。 The data are split into chunks of 100000 rows, and I've done six cycles of training over all the chunks in a random order. 数据被分成100000行的块，我以随机顺序对所有块进行了六个周期的训练。 Unfortunately, the network has begun to overfit. 不幸的是，网络已经开始过度适应。 I think it still has capacity to fit the data more closely; 我认为它仍然有能力更紧密地拟合数据; my suspicion is that internal regularities within each chunk are starting to contradict one another, and I need to shuffle the data more thoroughly so that it can train on different combinations. 我怀疑每个块内的内部规则开始相互矛盾，我需要更彻底地改变数据，以便它可以训练不同的组合。 I want to try this before going to the trouble of getting more training data. 我想在获得更多训练数据之前尝试这个。

Does anyone know a good way to generate a new permutation of 3.6 million (very long) rows of numpy data? 有谁知道一个很好的方法来生成360万（非常长）的numpy数据行的新排列？ I thought about using one of these techniques, but writing these arrays using numpy.savetxt produces unbelievably huge files, and I can't tell how to manipulate individual rows from a standard npy file in a way that helps to solve this problem. 我考虑过使用其中一种技术，但是使用numpy.savetxt编写这些数组会产生令人难以置信的巨大文件，我无法告诉如何以有助于解决此问题的方式操作标准npy文件中的各个行。

Right now, my best idea is to create a permutation of paired indices (c, r) into the data, where c choses a chunk and r choses a row from that chunk. 现在，我最好的想法是在数据中创建成对索引(c, r)的排列，其中c选择一个块， r从该块中选择一行。 I could store each row in a new preallocated array, and then save it. 我可以将每一行存储在一个新的预分配数组中，然后保存它。 But I wonder if there's a less horribly I/O-bound solution. 但我想知道是否有一个不那么可怕的I / O限制解决方案。 Is there some principled way to shuffle random pairs of chunks together until you get a permutation that's statistically independent from the starting permutation? 是否有一些原则性的方法可以将随机对的块组合在一起，直到你得到一个统计上独立于起始排列的排列？

Answer 1

Among the things I've tried so far, a PyTables solution is currently the best, followed by a solution that uses numpy 's support for memmapped arrays. 在我迄今为止尝试过的事情中，PyTables解决方案目前是最好的，其次是使用numpy支持memmapped数组的解决方案。 The PyTables solution is not straightforward though. 但PyTables解决方案并不简单。 If you use a shuffled array of integers to directly index a PyTables array, it's very slow. 如果你使用一个混洗的整数数组来直接索引PyTables数组，那么速度非常慢。 Much faster is the following two-step process: 以下两步过程要快得多：

Select a random subset of the array using a boolean index array. 使用布尔索引数组选择数组的随机子集。 This must be done in a chunkwise fashion . 这必须以一种方式完成 。 If you pass the index array directly to the PyTables array, it's slow. 如果将索引数组直接传递给PyTables数组，则速度很慢。
- Preallocate a numpy array and create a list of slices that split the PyTables array into chunks. 预分配一个numpy数组并创建一个切片列表，将PyTables数组拆分为块。
- Read each chunk entirely into memory, and then use the corresponding chunk of the index array to select the correct values for that chunk. 将每个块完全读入内存，然后使用索引数组的相应块为该块选择正确的值。
- Store the selected values in the preallocated array. 将选定的值存储在预分配的数组中。
Then shuffle the preallocated array. 然后随机播放预分配的数组。

This process produces a permutation as random as a normal shuffling process would. 这个过程产生的排列随着正常的混洗过程一样随机。 If that doesn't seem obvious, consider this: (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)! 如果这看起来不明显，请考虑： (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)! (n choose x) * x! = x! * n! / (x! * (n - x)!) = n! / (n - x)! . 。 This method is fast enough to do a shuffle-on-load for every training cycle. 这种方法足够快，可以为每个训练周期进行随机播放。 It's also able to compress the data down to ~650M -- nearly a 90% deflation. 它还能够将数据压缩到约650M - 几乎90％的通货紧缩。

Here's my current implementation; 这是我目前的实施; this is called once for every training chunk in the corpus. 对于语料库中的每个训练块，都会调用一次。 (The returned arrays are shuffled elsewhere.) （返回的数组在其他地方被洗牌。）

def _h5_fast_bool_ix(self, h5_array, ix, read_chunksize=100000):
    '''Iterate over an h5 array chunkwise to select a random subset
    of the array. `h5_array` should be the array itself; `ix` should
    be a boolean index array with as many values as `h5_array` has
    rows; and you can optionally set the number of rows to read per
    chunk with `read_chunksize` (default is 100000). For some reason
    this is much faster than using `ix` to index the array directly.'''

    n_chunks = h5_array.shape[0] / read_chunksize
    slices = [slice(i * read_chunksize, (i + 1) * read_chunksize)
              for i in range(n_chunks)]

    a = numpy.empty((ix.sum(), h5_array.shape[1]), dtype=float)
    a_start = 0
    for sl in slices:
        chunk = h5_array[sl][ix[sl]]
        a_end = a_start + chunk.shape[0]
        a[a_start:a_end] = chunk
        a_start = a_end

    return a

It's somewhat crazy to me that an O(n^2) approach (iterating over the entire PyTables array for every chunk) is faster in this case than an O(n) approach (randomly selecting each row in one pass). 对于我来说，O（n ^ 2）方法（在每个块上迭代整个PyTables数组）在这种情况下比O（n）方法（在一次通过中随机选择每一行）更快。 But hey, it works. 但是，嘿，它有效。 With a bit more indirection, this could be adapted for loading arbitrary non-random permutations, but that adds more complexity than it's worth here. 有了更多的间接性，这可以适用于加载任意非随机排列，但这增加了比这里更值得的复杂性。

The mmap solution is here for reference, for those people who need a pure numpy solution for whatever reason. mmap解决方案可供参考，供那些因任何原因需要纯粹的numpy解决方案的人使用。 It shuffles all the data in about 25 minutes, while the above solution manages the same in less than half that time. 它在大约25分钟内将所有数据洗牌，而上述解决方案在不到一半的时间内管理相同的数据。 This should scale linearly too, because mmap allows (relatively) efficient random access. 这也应该线性扩展，因为mmap允许（相对）有效的随机访问。

import numpy
import os
import random

X = []
Y = []

for filename in os.listdir('input'):
    X.append(numpy.load(os.path.join('input', filename), mmap_mode='r'))

for filename in os.listdir('output'):
    Y.append(numpy.load(os.path.join('output', filename), mmap_mode='r'))

indices = [(chunk, row) for chunk, rows in enumerate(X) 
                        for row in range(rows.shape[0])]
random.shuffle(indices)

newchunks = 50
newchunksize = len(indices) / newchunks

for i in range(0, len(indices), newchunksize):
    print i
    rows = [X[chunk][row] for chunk, row in indices[i:i + newchunksize]]
    numpy.save('X_shuffled_' + str(i), numpy.array(rows))
    rows = [Y[chunk][row] for chunk, row in indices[i:i + newchunksize]]
    numpy.save('Y_shuffled_' + str(i), numpy.array(rows))

Answer 2

The following assumes your data is already divided into easily-retrievable records of some sort. 以下假设您的数据已经分为某种易于检索的记录。 (I don't know if there's a standard file format for numpy data.) （我不知道是否有numpy数据的标准文件格式。）

Create an index of the data in the form of a dict , mapping each unique record ID (0 through n - 1) to some means of finding the data again. 以dict的形式创建数据索引，将每个唯一记录ID（0到n -1）映射到某种再次查找数据的方法。 For instance, if it's all in one binary file, you'd store a tuple of the form (file_offset, record_length) . 例如，如果它都在一个二进制文件中，那么你将存储一个形式的元组(file_offset, record_length) 。 No need to hold onto the data itself. 无需保留数据本身。
Create an list of n elements, containing the keys of the index dict (again, 0 through n - 1). 创建n个元素的列表，包含索引dict的键（同样，0到n - 1）。
Shuffle the list of record IDs. 随机播放记录ID列表。 (Provide your own random number generator, if needed.) （如果需要，请提供您自己的随机数生成器。）
Open a new file (or whatever) to contain the shuffled data. 打开一个新文件（或其他）以包含混洗数据。
Read record IDs out of the list from beginning to end. 从头到尾读取列表中的记录ID。 For each record ID, look up that record's location in the index. 对于每个记录ID，在索引中查找该记录的位置。 Grab the data at that location and append it to the output file. 抓取该位置的数据并将其附加到输出文件。

Pseudo-code: 伪代码：

# This assumes a binary file of unequal-length
# records.  It also assumes that the file won't
# be changed while we're doing this.

# Create index.
index = {}
rec_offset = 0
for rec_id, record in original_data.iterate_records():
    # This bit depends greatly on how your data
    # is stored...
    rec_length = len(record)
    index[rec_id] = (rec_offset, rec_length)
    rec_offset += rec_length

# Shuffle.
num_records_indexed = rec_id + 1  # rec_id is still in scope.
records_order = list(range(num_records_indexed))
records_order = random.shuffle(records_order, "<optional_RNG_here>")

# Create new shuffled-data file.
with open("output_file.bin", "wb") as output:
    for rec_id in records_order:
        rec_offset, rec_length = index[rec_id]
        record = original_data.get_rec_at(rec_offset, rec_length)
        output.write(record)

Indexing, shuffling, and de-indexing are all O( n ), so the worst part should be I/O: reading the data and then copying it (a second read, plus a write). 索引，混洗和去索引都是O（ n ），因此最糟糕的部分应该是I / O：读取数据然后复制它（第二次读取，再加上写入）。

统一调整5千兆字节的numpy数据

问题描述

2 个解决方案

解决方案1
7 已采纳 2014-11-27 16:34:47

解决方案2
1 2014-11-21 02:20:47

统一调整5千兆字节的numpy数据

问题描述

2 个解决方案

解决方案1 7 已采纳 2014-11-27 16:34:47

解决方案2 1 2014-11-21 02:20:47

解决方案1
7 已采纳 2014-11-27 16:34:47

解决方案2
1 2014-11-21 02:20:47