如何从太大而无法存储的文件中构建（或预先计算）直方图？

Question

是否有用于python的图形库，不需要将所有原始数据点存储为numpy数组或列表即可绘制直方图？

我的数据集太大而无法存储，并且我不想使用子采样来减少数据大小。

我正在寻找的是一个库，该库可以获取生成器的输出（每个数据点从文件中生成的float ），并动态生成直方图。

当生成器从文件中产生每个数据点时，这包括计算bin大小。

如果不存在这样的库 ，我想知道numpy是否能够根据产生的数据点预计算{bin_1:count_1, bin_2:count_2...bin_x:count_x}的计数器。

数据点作为垂直矩阵保存在选项卡文件中，该文件由node-node-score排列，如下所示：

node   node   5.55555

更多信息：

数据中有104301133行（到目前为止）
我不知道最小值或最大值
纸箱宽度应相同
箱数可能是1000

尝试的答案：

low = np.inf
high = -np.inf

# find the overall min/max
chunksize = 1000
loop = 0
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'):
    low = np.minimum(chunk.iloc[:, 2].min(), low)
    high = np.maximum(chunk.iloc[:, 2].max(), high)
    loop += 1
lines = loop*chunksize

nbins = math.ceil(math.sqrt(lines))   

bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64


# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'):

    # compute bin counts over the 3rd column
    subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

    # accumulate bin counts over chunks
    total += subtotal


plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
# plt.bar(np.arange(total.shape[0]), total, width=1)
plt.savefig('gsl_test_hist.svg')

输出：

Answer 1

您可以遍历数据集的大块，并使用np.histogram来将bin计数累积到单个向量中（您需要先定义bin边， np.histogram使用bins=参数将它们传递给np.histogram ），例如：

import numpy as np
import pandas as pd

bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.uint)

# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):

    # compute bin counts over the 3rd column
    subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)

    # accumulate bin counts over chunks
    total += subtotal.astype(np.uint)

如果要确保bin跨数组值的整个范围，但是您不知道最大值和最小值，则需要事先遍历一下以计算它们（例如，使用np.min / np.max ），例如：

low = np.inf
high = -np.inf

# find the overall min/max
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):
    low = np.minimum(chunk.iloc[:, 2].min(), low)
    high = np.maximum(chunk.iloc[:, 2].max(), high)

有了箱数数组后，您就可以直接使用plt.bar生成条形图：

plt.bar(bin_edges[:-1], total, width=1)

也可以将weights=参数用于plt.hist ，以便从计数向量（而非样本）生成直方图，例如：

plt.hist(bin_edges[:-1], bins=bin_edges, weights=total, ...)

如何从太大而无法存储的文件中构建（或预先计算）直方图？

问题描述

1 个解决方案

解决方案1
6 已采纳 2016-05-06 23:08:35

如何从太大而无法存储的文件中构建（或预先计算）直方图？

问题描述

1 个解决方案

解决方案1 6 已采纳 2016-05-06 23:08:35

解决方案1
6 已采纳 2016-05-06 23:08:35