Here's an example data frame I'm working with
level Income cumpop
1 17995.50 0.028405
2 20994.75 0.065550
3 29992.50 0.876185
4 41989.50 2.364170
5 53986.50 4.267305
6 65983.50 6.323390
7 77980.51 8.357625
8 89977.50 10.238910
9 101974.50 11.923545
10 113971.51 13.389680
11 125968.49 14.659165
12 137965.50 15.753850
13 149962.52 16.673735
14 161959.50 17.438485
15 173956.50 18.093985
16 185953.52 18.640235
17 197950.52 19.099085
18 209947.52 19.514235
19 221944.50 19.863835
20 233941.50 20.169735
21 251936.98 20.628585
22 275931.00 20.936670
23 383904.00 21.850000
The entire population of this particular country has been sorted by income and grouped into 23 corresponding 'levels'. The Income
variable is the average income of all members of that level (this is importantly different from saying, for example, that the 10th percentile income is 17995.50).
But the population size of each level is inconsistent (you'll notice this if you look at the difference in cumpop
ie cumulative population). Ultimately, I want to build a 10-row data frame that gives interpolated decile values for the variable Income
, so that, for example, we'd be able to say "the poorest 10% of the population on average make 28,000" or "those in the 20th to 30th percentile of the population on average make 41,000" or so on. So effectively I want to reduce these 23 levels into 10 levels of equal population size (taking cumpop[23] as the total population), which requires some interpolation.
I've looked around for a library that does this sort of empirical cumulative distribution function generation/interpolation and it seems ecdf
is quite useful, but I'm not sure how to apply it to Income
subject to cumpop
as described above.
Would greatly appreciate some direction here.
A quick and dirty solution using loess interploation. The span is set really short to ensure an essentially perfect fit, sadly this also makes any error terms meaningless. It could be worth trying a proper regression.
incdist <- read.table("inc.txt", header=TRUE)
fit <- loess(incdist$Income~incdist$cumpop, span=0.2)
V2 <- predict(fit, seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10))
V1 <- seq(0, max(incdist$cumpop)*9/10, max(incdist$cumpop)/10)
pred <- data.frame(V1, V2)
par(mar=c(5, 5.5, 4, 2) + 0.1)
plot(incdist$Income~incdist$cumpop, type="n", xaxt="n", yaxt="n",
xlab="percentile", ylab=expression(frac("average income",1000)),
main="income distribution")
abline(h=V2, v=V1[-1], col="grey")
points(incdist$Income~incdist$cumpop, col="grey")
lines(loess(incdist$Income~incdist$cumpop, span=0.2), col="red")
points(pred, col="blue", cex=1.5, pch=9)
axis(side=1, at=V1[-1], labels=c(1:9)*10)
axis(side=2, at=V2, labels=round(V2/1000), las=1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.