简体   繁体   English

基于频率数据高效拟合 Scipy 中的分布

[英]Fitting Distributions in Scipy Based on Frequency Data Efficiently

I have some data that I want to fit to a distribution.我有一些数据想要适合某个分布。 The data is given by the frequency.数据由频率给出。 What I mean is, I have every event that I have observed and the number of times that I have observed it.我的意思是,我拥有我观察到的每个事件以及我观察到的次数。 So something like:所以像:

data = [(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)]

where the first number in each tuple is the event I have observed, and the second number is the total observations for that event.每个元组中的第一个数字是我观察到的事件,第二个数字是该事件的总观察值。

With Scipy, I can fit (for example) a lognormal distribution using a call to scipy.stats.lognorm.fit.使用 Scipy,我可以使用对 scipy.stats.lognorm.fit 的调用来拟合(例如)对数正态分布。 However, this routine expects to see a list of all of the observations, not the frequencies.但是,此例程希望看到所有观察结果的列表,而不是频率。 I can fit the distribution like this:我可以像这样拟合分布:

import scipy
temp_data = []
for x in data:
    temp_data += [x[0]] * x[1]
params = scipy.stats.lognorm.fit(temp_data)

but wow, that seems horribly inefficient.但是哇,这似乎非常低效。

Is there a to fit a distribution, in Scipy or other similar tool, based upon the frequencies?是否可以根据频率在 Scipy 或其他类似工具中拟合分布? If not, is there a better way to fit the distribution without having to create a potentially giant list of values?如果没有,是否有更好的方法来拟合分布而不必创建潜在的巨大值列表?

Unfortunately, looking at the source , it seems like the 'materialized' aspect of the data is hardcoded.不幸的是,查看,似乎数据的“具体化”方面是硬编码的。 The function's not that complicated, though, so you could make your own version.不过,该功能并不复杂,因此您可以制作自己的版本。 TBH if your total N is still manageable I'd probably just do data = np.array(data); expanded_data = np.repeat(data[:,0], data[:,1]) TBH 如果你的总 N 仍然可以管理,我可能只会做data = np.array(data); expanded_data = np.repeat(data[:,0], data[:,1]) data = np.array(data); expanded_data = np.repeat(data[:,0], data[:,1]) despite the inefficiency, because life is short. data = np.array(data); expanded_data = np.repeat(data[:,0], data[:,1])尽管效率低下,因为寿命很短。

Another alternative would be to use pomegranate , which supports passing weights:另一种选择是使用pomegranate ,它支持传递权重:

import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pomegranate as pg

data = [(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)]

data = np.array(data)
expanded = np.repeat(data[:,0], data[:,1].astype(int))

scipy_shape, _, scipy_scale = scipy_params = scipy.stats.lognorm.fit(expanded, floc=0)
scipy_sigma, scipy_mu = scipy_shape, np.log(scipy_scale)

pg_dist = pg.LogNormalDistribution(0, 1)
pg_dist.fit(data[:,0], weights=data[:,1])
pg_mu, pg_sigma = pg_dist.parameters

fig = plt.figure()
ax = fig.add_subplot(111)

x = np.linspace(0.1, 10, 100)
ax.plot(data[:,0], data[:, 1] / data[:,1].sum(), label="freq")
ax.plot(x, scipy.stats.lognorm(*scipy_params).pdf(x),
        label=r"scipy: $\mu$ {:1.3f} $\sigma$ {:1.3f}".format(scipy_mu, scipy_sigma), alpha=0.5)
ax.plot(x, pg_dist.probability(x),
        label=r"pomegranate: $\mu$ {:1.3f} $\sigma$ {:1.3f}".format(pg_mu, pg_sigma), linestyle='--', alpha=0.5)
ax.legend(loc='upper right')
fig.savefig("compare.png")

gives me给我

scipy 与 pg 的比较

You can draw a random sample according to you frequency distribution, and fit that:您可以根据频率分布抽取随机样本,并拟合:

import scipy
import numpy as np

data = np.array(
    [(1, 34), (2, 1023), (3, 3243), (4, 879), (5, 202), (6, 10)], 
    dtype=float,
)
values = data[0]
weights = data[1]
seed = 87

gen = np.random.default_rng(seed)
sample = gen.choices(
    values, size=500, p=weights/np.sum(weights))

params = scipy.stats.lognorm.fit(values)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM