简体   繁体   中英

Fit distribution to given frequency values in R

I have frequency values changing with the time ( x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.

Q: Assuming that these frequency points are from Weibull distribution T , how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?

sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
            611,1037,727,489,432,371,1125,69,595,624)

plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)

在此处输入图片说明

Update . To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time ( x axis units) I mean I have data which says that I have:

  • 7787 realizations of value 1
  • 3056 realizations of value 2
  • 2359 realizations of value 3 ... etc.

Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:

# Loop to simulate values 
set.values <- c()
for(i in 1:length(sample)){
  set.values <<- c(set.values, rep(i, times = sample[i]))
}

hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)

在此处输入图片说明

and use fitdistr on the set.values :

f2 <- fitdistr(set.values, 'weibull')
f2

Why I think it is incorrect way and why I am looking for a better solution in R ?

  • in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T

  • in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function )

首先尝试所有要点

第二次尝试,第一点下降 Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.

Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot .

If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot . I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.

library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
            611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22

s.fit <- sample[3:23]
t.fit <- t.sample[3:23]

wx <- function(param) { 
  res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
  return(res)
} 
minwx <- function(param){
  v <- s.fit-wx(param)
  sqrt(sum(v*v))
}

p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))

popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f  yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)

plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)

You can directly calculate the maximum likelihood parameters, as described here .

# Defining the error of the implicit function
k.diff <- function(k, vec){
  x2 <- seq(length(vec))
  abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2), 
                                                            w = x2^k*sample))
}

# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min

# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)

# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))

Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:

sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
        611,1037,727,489,432,371,1125,69,595,624)
 f<-fitdistr(sample, 'weibull')
 f

If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.

For your example this would look like this:

 ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
 ks

The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.

Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.

f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)

f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1]) 
hist(z)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM