Let's say I have a time-series like this
t x
1 100
2 50
3 200
4 210
5 90
6 80
7 300
Is it possible in R to generate a new dataset x1
which has the exact same summary statistics, eg mean, variance, kurtosis, skew as x
?
The reason for my asking is that I would like to do an experiment where I want to test how subjects react to different graphs of data that contain the same information.
I recently read:
Matejka, Justin, and George Fitzmaurice. "Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2017.
Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset, The American Statistician, 2007,
However, Matejka uses code in Python that is quite scientific and their data is more complex than time-series data, so I was wondering if there was a way to do this more efficiently for a simpler data set?
Best regards
I'm not aware of a package that can give you precisely what you are looking for. One option is using the datasets in the datasauRus
package as JasonAizkalns pointed out. However, if you want to create your own dataset, you can try this: Fit the Johnson distribution
from the SuppDists
package to get the moments of the dataset and draw new sets from that distribution until the difference is sufficiently small. Below an example with your dataset, although more observations make it easier to replicate the summary statistics:
library(SuppDists)
a <- c(100,50,200,210,90,80,300)
momentsDiffer <- function(x1,x2){
diff <- sum(abs(moments(x1)- moments(x2)))
return(diff)
}
repDataset <- function(x,n){
# fit Johnson distribution
parms<-JohnsonFit(a, moment="quant")
# generate from distribution n times storing if improved
current <- rJohnson(length(a),parms)
momDiff <- momentsDiffer(x,current)
for(i in 1:n){
temp <- rJohnson(length(a),parms)
tempDiff <- momentsDiffer(x,temp)
if(tempDiff < momDiff){
current <- temp
momDiff <- tempDiff
}
}
return(current)
}
# Drawing 1000 times to allow improvement
b <- repDataset(a,1000)
> moments(b)
mean sigma skew kurt
148.14048691 84.24884165 1.04201116 -0.05008629
> moments(a)
mean sigma skew kurt
147.1428571 84.1281821 0.5894543 -1.0198303
EDIT - Added additional method Following the suggestion of @Jj Blevins, the method below generates a random sequence based upon the original sequence leaving out 4 observations. Those 4 observations are then added through solving a non-linear equation on the difference between the four moments of the original sequence and the new sequence. This still not generate a perfect match, feel free to improve.
library(nleqslv)
library(e1071)
set.seed(1)
a <- c(100,50,200,210,90,80,300)
#a <- floor(runif(1000,0,101))
init <- floor(runif(length(a)-4,min(a),max(a)+1))
moments <- moments(a)
f <- function(x) {
a <- mean(c(init,x))
b <- var(c(init,x))
c <- skewness(c(init,x))
d <- kurtosis(c(init,x))
c(a-moments[1],b-moments[2],c-moments[3],d-moments[4])
}
result <- nleqslv(runif(4,min(a),max(a)+1), f,control=list(ftol=.00000001, allowSingular=TRUE))
> moments(c(init,result$x))
mean sigma skew kurt
49.12747961 29.85435993 0.03327868 -1.25408078
> moments(a)
mean sigma skew kurt
49.96600000 29.10805462 0.03904256 -1.18250616
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.