[英]How to properly fit data to a power law in Python?
I am considering the number of occurrence of unique words in the Moby Dick novel and using the powerlaw
python package to fit words' frequencies to a power law.我正在考虑Moby Dick 小说中独特单词的出现次数,并使用
powerlaw
python 包将单词的频率拟合到幂律。
I am not sure why I can't recapitulate the results from previous work by Clauset et al.我不知道为什么我不能概括 Clauset 等人以前工作的结果。 as both the p-value and the KS score are "bad".
因为 p 值和 KS 分数都是“坏的”。
The idea is to fit the frequencies of unique words into a power law.这个想法是将独特单词的频率拟合为幂律。 However, the Kolmogorov-Smirnov tests for goodness of fit calculated by
scipy.stats.kstest
look terrible.然而,由
scipy.stats.kstest
计算的拟合优度的 Kolmogorov-Smirnov 测试看起来很糟糕。
I have the following function to fit data to a power law:我有以下函数来使数据符合幂律:
import numpy as np
import powerlaw
import scipy
from scipy import stats
def fit_x(x):
fit = powerlaw.Fit(x, discrete=True)
alpha = fit.power_law.alpha
xmin = fit.power_law.xmin
print('powerlaw', scipy.stats.kstest(x, "powerlaw", args=(alpha, xmin), N=len(x)))
print('lognorm', scipy.stats.kstest(x, "lognorm", args=(np.mean(x), np.std(x)), N=len(x)))
Downloading the frequency of unique words in the novel Moby Dick by Herman Melville (supposed to follow a power law according to Aaron Clauset et al.):下载 Herman Melville 的小说 Moby Dick 中独特单词的频率(根据 Aaron Clauset 等人的说法,应该遵循幂律):
wget http://tuvalu.santafe.edu/~aaronc/powerlaws/data/words.txt
Python script:蟒蛇脚本:
x = np.loadtxt('./words.txt')
fit_x(x)
results:结果:
('powerlaw', KstestResult(statistic=0.862264651286131, pvalue=0.0))
('log norm', KstestResult(statistic=0.9910368602492707, pvalue=0.0))
When I compare the expected results and follow this R tutorial on the same Moby Dick dataset I get a decent p-value and KS test value:当我比较预期结果并在同一个 Moby Dick 数据集上遵循这个R 教程时,我得到了一个不错的 p 值和 KS 测试值:
library("poweRlaw")
data("moby", package="poweRlaw")
m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)
bs_p = bootstrap_p(m_pl)
bs_p$p
## [1] 0.6738
What am I missing when computing the KS test values and postprocessing the fit by the powerlaw python library?在计算 KS 测试值并通过powerlaw python 库对拟合进行后处理时,我缺少什么? The PDF and CDF look ok to me, but the KS tests look awry.
PDF 和 CDF 对我来说看起来不错,但 KS 测试看起来有问题。
I think you should pay attention to whether the data is continuous or discrete, and then choose the appropriate test method;我觉得你应该注意数据是连续的还是离散的,然后选择合适的测试方法; in addition, as the former said, the size of the data will have a certain impact on the result, I hope it will help you
另外,前面说了,数据的大小会对结果有一定的影响,希望对你有帮助
It is still not clear to me how to determine significance and goodness of fit by using the scipy.stats.kstest
with the powerlaw
library.它仍然是我不清楚如何使用,以确定意义和拟合优度
scipy.stats.kstest
与powerlaw
库。
Though, powerlaw
implements its own distribution_compare
capability which returns the likelihood ratio R
and the p-val
of R
(see some content from Aaron Clauset on here ):虽然,
powerlaw
实现其自己的distribution_compare
能力它返回似然比R
和p-val
的R
见从亚伦Clauset上的一些内容这里):
R : float Loglikelihood ratio of the two distributions' fit to the data.
R :两个分布与数据拟合的浮点对数似然比。 If greater than 0, the first distribution is preferred.
如果大于 0,则首选第一种分布。 If less than 0, the second distribution is preferred.
如果小于 0,则首选第二个分布。
p : float Significance of R
p : float R 的意义
from numpy import genfromtxt
import urllib
import powerlaw
urllib.urlretrieve('https://raw.github.com/jeffalstott/powerlaw/master/manuscript/words.txt', 'words.txt')
words = genfromtxt('words.txt')
fit = powerlaw.Fit(words, discrete=True)
print(fit.distribution_compare('power_law', 'exponential', normalized_ratio=True))
(9.135914718776998, 6.485614241379581e-20)
print(fit.distribution_compare('power_law', 'truncated_power_law'))
(-0.917123083373983, 0.1756268316869548)
print(fit.distribution_compare('power_law', 'truncated_power_law'))
(-0.917123083373983, 0.1756268316869548)
print(fit.distribution_compare('power_law', 'lognormal'))
(0.008785246720842022, 0.9492243713193919)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.