简体   繁体   中英

R: adjusting a given time-series but keeping summary statistics equal

Let's say I have a time-series like this

t       x
1       100
2       50
3       200
4       210
5       90
6       80
7       300

Is it possible in R to generate a new dataset x1 which has the exact same summary statistics, eg mean, variance, kurtosis, skew as x ?

The reason for my asking is that I would like to do an experiment where I want to test how subjects react to different graphs of data that contain the same information.

I recently read:

  • Matejka, Justin, and George Fitzmaurice. "Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2017.

  • Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset, The American Statistician, 2007,

However, Matejka uses code in Python that is quite scientific and their data is more complex than time-series data, so I was wondering if there was a way to do this more efficiently for a simpler data set?

Best regards

I'm not aware of a package that can give you precisely what you are looking for. One option is using the datasets in the datasauRus package as JasonAizkalns pointed out. However, if you want to create your own dataset, you can try this: Fit the Johnson distribution from the SuppDists package to get the moments of the dataset and draw new sets from that distribution until the difference is sufficiently small. Below an example with your dataset, although more observations make it easier to replicate the summary statistics:

library(SuppDists)
a <- c(100,50,200,210,90,80,300)

momentsDiffer <- function(x1,x2){
  diff <- sum(abs(moments(x1)- moments(x2)))
  return(diff)
}

repDataset <- function(x,n){
  # fit Johnson distribution
  parms<-JohnsonFit(a, moment="quant")
  # generate from distribution n times storing if improved
  current <- rJohnson(length(a),parms)
  momDiff <- momentsDiffer(x,current)
  for(i in 1:n){
    temp <- rJohnson(length(a),parms)
    tempDiff <- momentsDiffer(x,temp)
    if(tempDiff < momDiff){
      current <- temp
      momDiff <- tempDiff
    }
  }
  return(current)
}

# Drawing 1000 times to allow improvement
b <- repDataset(a,1000)
> moments(b)
        mean        sigma         skew         kurt 
148.14048691  84.24884165   1.04201116  -0.05008629 

> moments(a)
       mean       sigma        skew        kurt 
147.1428571  84.1281821   0.5894543  -1.0198303 

EDIT - Added additional method Following the suggestion of @Jj Blevins, the method below generates a random sequence based upon the original sequence leaving out 4 observations. Those 4 observations are then added through solving a non-linear equation on the difference between the four moments of the original sequence and the new sequence. This still not generate a perfect match, feel free to improve.

library(nleqslv)
library(e1071)
set.seed(1)
a <- c(100,50,200,210,90,80,300)
#a <- floor(runif(1000,0,101))

init <- floor(runif(length(a)-4,min(a),max(a)+1))
moments <- moments(a)

f <- function(x) {
  a <- mean(c(init,x))
  b <- var(c(init,x))
  c <- skewness(c(init,x))
  d <- kurtosis(c(init,x))
  c(a-moments[1],b-moments[2],c-moments[3],d-moments[4])
}
result <- nleqslv(runif(4,min(a),max(a)+1), f,control=list(ftol=.00000001, allowSingular=TRUE))

> moments(c(init,result$x))
       mean       sigma        skew        kurt 
49.12747961 29.85435993  0.03327868 -1.25408078 

> moments(a)
       mean       sigma        skew        kurt 
49.96600000 29.10805462  0.03904256 -1.18250616

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM