如何正确地将数据拟合到 Python 中的幂律？

Question

I am considering the number of occurrence of unique words in the Moby Dick novel and using the powerlaw python package to fit words' frequencies to a power law.我正在考虑Moby Dick 小说中独特单词的出现次数，并使用powerlaw python 包将单词的频率拟合到幂律。

I am not sure why I can't recapitulate the results from previous work by Clauset et al.我不知道为什么我不能概括 Clauset 等人以前工作的结果。 as both the p-value and the KS score are "bad".因为 p 值和 KS 分数都是“坏的”。

The idea is to fit the frequencies of unique words into a power law.这个想法是将独特单词的频率拟合为幂律。 However, the Kolmogorov-Smirnov tests for goodness of fit calculated by scipy.stats.kstest look terrible.然而，由scipy.stats.kstest计算的拟合优度的 Kolmogorov-Smirnov 测试看起来很糟糕。

I have the following function to fit data to a power law:我有以下函数来使数据符合幂律：

import numpy as np
import powerlaw
import scipy
from scipy import stats

def fit_x(x):
    fit = powerlaw.Fit(x, discrete=True)
    alpha = fit.power_law.alpha
    xmin  = fit.power_law.xmin
    print('powerlaw', scipy.stats.kstest(x, "powerlaw", args=(alpha, xmin), N=len(x)))
    print('lognorm', scipy.stats.kstest(x, "lognorm", args=(np.mean(x), np.std(x)), N=len(x)))

Downloading the frequency of unique words in the novel Moby Dick by Herman Melville (supposed to follow a power law according to Aaron Clauset et al.):下载 Herman Melville 的小说 Moby Dick 中独特单词的频率（根据 Aaron Clauset 等人的说法，应该遵循幂律）：

wget http://tuvalu.santafe.edu/~aaronc/powerlaws/data/words.txt

Python script:蟒蛇脚本：

x =  np.loadtxt('./words.txt')
fit_x(x)

results:结果：

('powerlaw', KstestResult(statistic=0.862264651286131, pvalue=0.0))
('log norm', KstestResult(statistic=0.9910368602492707, pvalue=0.0))

When I compare the expected results and follow this R tutorial on the same Moby Dick dataset I get a decent p-value and KS test value:当我比较预期结果并在同一个 Moby Dick 数据集上遵循这个R 教程时，我得到了一个不错的 p 值和 KS 测试值：

library("poweRlaw")
data("moby", package="poweRlaw")
m_pl = displ$new(moby)
est = estimate_xmin(m_pl)
m_pl$setXmin(est)
bs_p = bootstrap_p(m_pl)
bs_p$p
## [1] 0.6738

What am I missing when computing the KS test values and postprocessing the fit by the powerlaw python library?在计算 KS 测试值并通过powerlaw python 库对拟合进行后处理时，我缺少什么？ The PDF and CDF look ok to me, but the KS tests look awry. PDF 和 CDF 对我来说看起来不错，但 KS 测试看起来有问题。

Answer 1

I think you should pay attention to whether the data is continuous or discrete, and then choose the appropriate test method;我觉得你应该注意数据是连续的还是离散的，然后选择合适的测试方法； in addition, as the former said, the size of the data will have a certain impact on the result, I hope it will help you另外，前面说了，数据的大小会对结果有一定的影响，希望对你有帮助

Answer 2

It is still not clear to me how to determine significance and goodness of fit by using the scipy.stats.kstest with the powerlaw library.它仍然是我不清楚如何使用，以确定意义和拟合优度scipy.stats.kstest与powerlaw库。

Though, powerlaw implements its own distribution_compare capability which returns the likelihood ratio R and the p-val of R (see some content from Aaron Clauset on here ):虽然， powerlaw实现其自己的distribution_compare能力它返回似然比R和p-val的R见从亚伦Clauset上的一些内容这里）：

R : float Loglikelihood ratio of the two distributions' fit to the data. R ：两个分布与数据拟合的浮点对数似然比。 If greater than 0, the first distribution is preferred.如果大于 0，则首选第一种分布。 If less than 0, the second distribution is preferred.如果小于 0，则首选第二个分布。

p : float Significance of R p : float R 的意义

from numpy import genfromtxt
import urllib
import powerlaw

urllib.urlretrieve('https://raw.github.com/jeffalstott/powerlaw/master/manuscript/words.txt', 'words.txt')
words = genfromtxt('words.txt')

fit = powerlaw.Fit(words, discrete=True)

print(fit.distribution_compare('power_law', 'exponential', normalized_ratio=True))
(9.135914718776998, 6.485614241379581e-20)
print(fit.distribution_compare('power_law', 'truncated_power_law'))
(-0.917123083373983, 0.1756268316869548)
print(fit.distribution_compare('power_law', 'truncated_power_law'))
(-0.917123083373983, 0.1756268316869548)
print(fit.distribution_compare('power_law', 'lognormal'))
(0.008785246720842022, 0.9492243713193919)

如何正确地将数据拟合到 Python 中的幂律？

问题描述

2 个解决方案

解决方案1
1 2020-09-09 09:24:51

解决方案2
0 2021-09-26 07:50:53

如何正确地将数据拟合到 Python 中的幂律？

问题描述

2 个解决方案

解决方案1 1 2020-09-09 09:24:51

解决方案2 0 2021-09-26 07:50:53

解决方案1
1 2020-09-09 09:24:51

解决方案2
0 2021-09-26 07:50:53