Is it possible to create a graph form 0% to 100% on the x axis and units on the y, and accumulate from y=0 to y=max, so I can say "X of my elements occured within the first Y units". Is there a predefined stat in ggplot2 which allows me to do that?
Here's some data: http://sprunge.us/XYJK
You can either apply it before processing with ggplot or during:
For example:
library(ggplot2)
library(scales)
library(XML)
x <- eval(parse(file("http://sprunge.us/XYJK"))) # Your data
d <- data.frame(x=x,y=1:length(x))
d$z <- cumsum(d$x) / sum(d$x) # As percent
ggplot(d, aes(z,y)) + geom_line() + scale_x_continuous(label=percent)
OR
library(ggplot2)
library(scales)
d <- data.frame(x=x,y=1:100)
ggplot(d, aes(cumsum(x)/sum(x),y) + geom_line() + scale_x_continuous(label=percent)
I'm assuming this is sales data or something like that. So putting it in that context, 50% of revenue occurs from the first 5000 transactions.
It sounds to me as though you're looking for an empirical CDF. Your data have replicated values in a number of places, so I created the empirical CDF based on a frequency table of the sorted values. I copied your data into a vector x and then did the following:
tf <- as.data.frame(table(x), stringsAsFactors = FALSE)
tf <- within(tf, {
Var1 <- as.numeric(Var1)
pct <- 100 * cumsum(Freq)/sum(Freq)
} )
ggplot(tf, aes(x = Var1, y = pct)) +
geom_step(size = 1) +
labs(x = "Value", y = "Cumulative percentage")
The problem is that your data are so heavily right skewed that the histogram emulates a hyperbolic curve, so the vast majority of the data is well under 1000 with several serious outliers. To give you an idea,
quantile(x, c(0.005, 0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.995))
0.5% 1% 5% 10% 25% 50%
1.64425 2.79850 7.54500 11.77500 21.76000 39.35000
75% 90% 95% 99% 99.5%
73.28000 398.05000 1695.78750 10499.99000 11638.55600
and
summary(tst$y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 21.76 39.35 434.90 73.28 18520.00
The mean is larger than the 90th percentile of the distribution! In that context, I don't think an ecdf plot is going to be very informative. To find out what proportion of values in your vector is less than or equal to a given value, try the following small function:
cumprop <- function(x, val) mean(x <= val)
cumprop(x, 1000)
cumprop(x, mean(x)) # proportion of values <= mean(x)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.