简体   繁体   中英

Accumulating plot in ggplot2

Is it possible to create a graph form 0% to 100% on the x axis and units on the y, and accumulate from y=0 to y=max, so I can say "X of my elements occured within the first Y units". Is there a predefined stat in ggplot2 which allows me to do that?

Here's some data: http://sprunge.us/XYJK

You can either apply it before processing with ggplot or during:

For example:

library(ggplot2)
library(scales)
library(XML) 
x <- eval(parse(file("http://sprunge.us/XYJK"))) # Your data
d <- data.frame(x=x,y=1:length(x)) 
d$z <- cumsum(d$x) / sum(d$x) # As percent

ggplot(d, aes(z,y)) + geom_line() + scale_x_continuous(label=percent)

OR

library(ggplot2)
library(scales)
d <- data.frame(x=x,y=1:100)
ggplot(d, aes(cumsum(x)/sum(x),y) + geom_line() + scale_x_continuous(label=percent)

I'm assuming this is sales data or something like that. So putting it in that context, 50% of revenue occurs from the first 5000 transactions.

It sounds to me as though you're looking for an empirical CDF. Your data have replicated values in a number of places, so I created the empirical CDF based on a frequency table of the sorted values. I copied your data into a vector x and then did the following:

tf <- as.data.frame(table(x), stringsAsFactors = FALSE)
tf <- within(tf, {
          Var1 <- as.numeric(Var1)
          pct <- 100 * cumsum(Freq)/sum(Freq)
                 } )
ggplot(tf, aes(x = Var1, y = pct)) + 
    geom_step(size = 1) +
    labs(x = "Value", y = "Cumulative percentage")

The problem is that your data are so heavily right skewed that the histogram emulates a hyperbolic curve, so the vast majority of the data is well under 1000 with several serious outliers. To give you an idea,

quantile(x, c(0.005, 0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.995))
   0.5%          1%          5%         10%         25%         50% 
 1.64425      2.79850     7.54500    11.77500    21.76000    39.35000 
    75%         90%         95%         99%       99.5% 
 73.28000   398.05000   1695.78750 10499.99000 11638.55600

and

summary(tst$y)
Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.00    21.76    39.35   434.90    73.28 18520.00

The mean is larger than the 90th percentile of the distribution! In that context, I don't think an ecdf plot is going to be very informative. To find out what proportion of values in your vector is less than or equal to a given value, try the following small function:

cumprop <- function(x, val) mean(x <= val)
cumprop(x, 1000)
cumprop(x, mean(x))  # proportion of values <= mean(x)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM