[英]How to build (or precompute) a histogram from a file too large for memory?
是否有用于python的图形库,不需要将所有原始数据点存储为numpy
数组或列表即可绘制直方图?
我的数据集太大而无法存储 ,并且我不想使用子采样来减少数据大小。
我正在寻找的是一个库,该库可以获取生成器的输出(每个数据点从文件中生成的float
),并动态生成直方图。
当生成器从文件中产生每个数据点时,这包括计算bin大小。
如果不存在这样的库 ,我想知道numpy
是否能够根据产生的数据点预计算{bin_1:count_1, bin_2:count_2...bin_x:count_x}
的计数器。
数据点作为垂直矩阵保存在选项卡文件中,该文件由node-node-score
排列,如下所示:
node node 5.55555
更多信息:
尝试的答案:
low = np.inf
high = -np.inf
# find the overall min/max
chunksize = 1000
loop = 0
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low)
high = np.maximum(chunk.iloc[:, 2].max(), high)
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
# plt.bar(np.arange(total.shape[0]), total, width=1)
plt.savefig('gsl_test_hist.svg')
您可以遍历数据集的大块,并使用np.histogram
来将bin计数累积到单个向量中(您需要先定义bin边, np.histogram
使用bins=
参数将它们传递给np.histogram
),例如:
import numpy as np
import pandas as pd
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.uint)
# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)
# accumulate bin counts over chunks
total += subtotal.astype(np.uint)
如果要确保bin跨数组值的整个范围,但是您不知道最大值和最小值,则需要事先遍历一下以计算它们(例如,使用np.min
/ np.max
),例如:
low = np.inf
high = -np.inf
# find the overall min/max
for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000):
low = np.minimum(chunk.iloc[:, 2].min(), low)
high = np.maximum(chunk.iloc[:, 2].max(), high)
有了箱数数组后,您就可以直接使用plt.bar
生成条形图:
plt.bar(bin_edges[:-1], total, width=1)
也可以将weights=
参数用于plt.hist
,以便从计数向量(而非样本)生成直方图,例如:
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total, ...)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.