简体   繁体   中英

R: Plot ecdf of one column on an axis of another column with ggplot

I'm sure this can be done by separately collecting all the data and then just using ggplot for the plotting, but I'd really prefer a simpler solution implementing ggplot, particulalry stat_ecdf() because of easier access to grouping variables, facets, etc.

My dataframe contains, amongst others, two columns of corresponding data x and y. I'd like to plot the ecdf of y on an axis of the corresponding x values. In other words, I'd like to plot what cumulative portion of the y variable is reached at its corresponding x value. While x and y are correlated (both descending), they are not analytically connected, so I cannot simply scale values of y to x. My attempts to do this with separate calculations of the ecdf functions of each subset have gotten extremely messy and complicated, while the stat_ecdf function seems to be very close to getting me what I need.

If I set the x variable in the ggplot aes to x and then set the variable within stat_ecdf to y, I am able to get the ecdf of y with axis labels of x; however, the actual values on the axis correspond to x. I'm plotting This is done with something like:

ggplot(df, aes(x, color=group_var)) + stat_ecdf(aes(y))

EDIT: To visualize this: This sample plot shows the ecdf of x for multiple groups. Each x value has a corresponding y value in a sorted dataframe ( approximate relationship, ignore the decreasing regions at the end . I would like to have a similar plot where the horizontal axis is in the corresponding y values. Basically, I need to map the horizontal axis of the first ecdf plot from x->y as simply as possible. I could do this manually by adding ecdf values as a column in the dataframe, but I am looking to do it within ggplot for simplicity, if possible.

Instead of trying to bend stat_ecdf to do something it was not designed for, it's better to be explicit about your intention in the code.

It's quite straightforward. The most weird piece of code: ecdf(y)(y) menas 'calculate the empirical CDF for y , and then evaluate it for the actual values of y in my data. The cummax deals with the decreasing y , to get ever increasing eCDF along x .

d_sample %>%
  group_by(group) %>%
  arrange(group, x) %>%
  mutate(
    fraction = ecdf(y)(y),
    maxf = pmax(fraction, cummax(fraction))) %>%
  ggplot(aes(x, maxf)) +
  geom_point() +
  facet_wrap(~group)

样本数据的 ecdf

I'm still not really sure if that's what you need.

Sample data

To be honest it took me most of the time to 'fake' your dataset:

library(tidyverse)

tibble(x = seq_len(300) + 100) %>%
  mutate(
    one = - 1e-3 * (x * x) + 50 + 0.7 * x,
    two = - 1e-3 * (x * x) + 55 + 0.68 * x,
    three = - 1e-3 * (x * x) + 110 + 0.5 * x,
    four = - 1e-3 * (x * x) + 10 + 0.8 * x) %>%
  pivot_longer(-x, names_to = "group", values_to = "y") %>%
  filter(
    group == "one"
    | group == "two"
    | (group == "three" & x < 200)
    | (group == "four" & x > 250)) ->
  d_sample

d_sample %>%
  ggplot(aes(x, y, colour = group)) +
  geom_point()

样本数据散点图

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM