简体   繁体   中英

Building an empirical cumulative distribution function and data interpolation in R

Here's an example data frame I'm working with

 level    Income    cumpop
 1      17995.50  0.028405
 2      20994.75  0.065550
 3      29992.50  0.876185
 4      41989.50  2.364170
 5      53986.50  4.267305
 6      65983.50  6.323390
 7      77980.51  8.357625
 8      89977.50 10.238910
 9     101974.50 11.923545
10     113971.51 13.389680
11     125968.49 14.659165
12     137965.50 15.753850
13     149962.52 16.673735
14     161959.50 17.438485
15     173956.50 18.093985
16     185953.52 18.640235
17     197950.52 19.099085
18     209947.52 19.514235
19     221944.50 19.863835
20     233941.50 20.169735
21     251936.98 20.628585
22     275931.00 20.936670
23     383904.00 21.850000

The entire population of this particular country has been sorted by income and grouped into 23 corresponding 'levels'. The Income variable is the average income of all members of that level (this is importantly different from saying, for example, that the 10th percentile income is 17995.50).

But the population size of each level is inconsistent (you'll notice this if you look at the difference in cumpop ie cumulative population). Ultimately, I want to build a 10-row data frame that gives interpolated decile values for the variable Income , so that, for example, we'd be able to say "the poorest 10% of the population on average make 28,000" or "those in the 20th to 30th percentile of the population on average make 41,000" or so on. So effectively I want to reduce these 23 levels into 10 levels of equal population size (taking cumpop[23] as the total population), which requires some interpolation.

I've looked around for a library that does this sort of empirical cumulative distribution function generation/interpolation and it seems ecdf is quite useful, but I'm not sure how to apply it to Income subject to cumpop as described above.

Would greatly appreciate some direction here.

A quick and dirty solution using loess interploation. The span is set really short to ensure an essentially perfect fit, sadly this also makes any error terms meaningless. It could be worth trying a proper regression.

incdist <- read.table("inc.txt", header=TRUE)

fit <- loess(incdist$Income~incdist$cumpop, span=0.2)
V2 <- predict(fit, seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10))
V1 <- seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10)
pred <- data.frame(V1, V2)

par(mar=c(5, 5.5, 4, 2) + 0.1)

plot(incdist$Income~incdist$cumpop, type="n", xaxt="n", yaxt="n",
    xlab="percentile", ylab=expression(frac("average income",1000)),
    main="income distribution")

abline(h=V2, v=V1[-1], col="grey")
points(incdist$Income~incdist$cumpop, col="grey")
lines(loess(incdist$Income~incdist$cumpop, span=0.2), col="red")
points(pred, col="blue", cex=1.5, pch=9)
axis(side=1, at=V1[-1], labels=c(1:9)*10)
axis(side=2, at=V2, labels=round(V2/1000), las=1)

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM