简体   繁体   中英

Cumulative distribution function of count as ratio for subset of data

I am trying to plot CDFs from multiple data on one plot for a subset range. I subset them as the values can be very large and I do not wish to have a large x-axis range. Regardless of the subset range, the CDF always has a y-axis range from 0 to 1. As data exists outside of the subset range, the CDF should never go to 1, but rather a slightly smaller ratio. How might I go about computing CDFs for the whole distribution, then subset them in the plot?

This code plots CDFs, however they do not respect that there exists data beyond the x-axis range. At or around x=50, y=1, which is impossible. I have tried a few other stat_ecdf options (commented # below) with no success.

library(moments)
library(ggplot2)
library(plyr)
library(dplyr)
library(reshape2)
library(RColorBrewer)
library(cowplot)
library(scales)
library(gridExtra)
require(data.table)
require(grid)

set.seed(8)

dat1 <- data.frame(a = replicate(1,sample(25:300,1000,rep=TRUE)))
dat2 <- data.frame(b = replicate(1,sample(25:350,950,rep=TRUE)))
dat3 <- data.frame(c = replicate(1,sample(25:400,965,rep=TRUE)))
dat4 <- data.frame(d = replicate(1,sample(25:450,970,rep=TRUE)))

d1_bind = bind_rows(dat1,dat2,dat3,dat4)

md1 <- melt(d1_bind)
colnames(md1) <- c("Dat","Value")
summary(md1)

ggplot(md1, aes(x = Value, color=Dat, linetype=Dat)) +
           stat_ecdf(aes(color = Dat),
#           pad = TRUE, # this does not plot correctly
#           n = 38850, # this or set to NULL does not plot correctly
           geom = "line", size = 1) +
           scale_linetype_manual(values=c("solid", "solid", "solid", "solid")) +
           scale_y_continuous(limits = c(0, 1.0), breaks = seq(0, 1.0, by = 0.05)) +
           scale_x_continuous(limits = c(25, 50)) +
#           scale_x_discrete(breaks = 26:451) + # this does not plot correctly
           scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73"))

quit()

Using stat_bin and manually computing the cumulative sum, results in the same plot as the stat_ecdf above.

ggplot(md1, aes(x = Value, color=Dat, linetype=Dat)) +
  stat_bin(aes(y = cumsum(..count..)/sum(..count..)),
  geom = "line", size = 1) +
  scale_linetype_manual(values=c("solid", "solid", "solid", "solid")) +
  scale_y_continuous(limits = c(0, 1.0), breaks = seq(0, 1.0, by = 0.05)) +
  scale_x_continuous(limits = c(25, 50)) +
  scale_color_manual(values = c("#000000", "#E69F00", "#56B4E9", "#009E73"))

Your code and the following code gives, which is what I would expect:

library(dplyr)
library(tidyr)
library(ggplot2)

set.seed(8)

dat1 <- data.frame(a = replicate(1,sample(25:300,1000,rep=TRUE)))
dat2 <- data.frame(b = replicate(1,sample(25:350,950,rep=TRUE)))
dat3 <- data.frame(c = replicate(1,sample(25:400,965,rep=TRUE)))
dat4 <- data.frame(d = replicate(1,sample(25:450,970,rep=TRUE)))

df <- bind_rows(dat1, dat2, dat3, dat4, .id = "dat")

df1 <- df %>% 
  pivot_longer(cols = a:d, values_drop_na = T)

ggplot(df1, aes(x = value, color = dat, linetype = dat)) + 
  stat_ecdf(aes(color = dat))

在此处输入图像描述

If you want to set the limits without recalculating the ECDF (ie "zoom" in on the graph) then use coord_cartesian not scale_x_continuous :

ggplot(df1, aes(x = value, color = dat, linetype = dat)) +
  stat_ecdf() + 
  coord_cartesian(xlim = c(25, 50),
                  ylim = c(0, 0.1))

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM