I have uneven lengths in my huge data set. Ie, 700 observations for 2016, 400 observations from 2017. I have a lot of years of data, so manually clipping the datasets is not feasible.
I want to cut them both into quantiles for observations, but only the first 400 for each group.=
There is a tantalizing "minmax" argument in the Hmisc documentation . Is it possible to use the minmax an argument so Hmisc to only cut quantiles from observations 1-400?
Using dplyr , you can select the first 400 records for each value of year using group_by
and slice
. Then create quantiles, either within each year or overall.
set.seed(911) # Simulate some uneven data
df <- data.frame(year=rep(2016:2018, times=c(400,500,600)),
val=rnorm(1500,50,5))
library(dplyr); library(tidyr)
This creates quantiles within each year
df %>% group_by(year) %>%
slice(1:400) %>%
mutate(q4 = cut(val,
breaks=quantile(val,
probs = seq(0,1,1/4)),
include=TRUE, labels=FALSE)) %>%
# You can stop here and save the output, here I continue to check the counts
count(q4) %>%
pivot_wider(names_from=q4, values_from=n)
# A tibble: 3 x 5
# Groups: year [3]
# year `1` `2` `3` `4`
# <int> <int> <int> <int> <int>
#1 2016 100 100 100 100
#2 2017 100 100 100 100
#3 2018 100 100 100 100
Or you can ungroup to create overall quantiles (counts will differ per year).
df %>% group_by(year) %>%
slice(1:400) %>%
ungroup() %>%
mutate(q4 = cut(val,
breaks=quantile(val,
probs = seq(0,1,1/4)),
include=TRUE, labels=FALSE)) %>%
# Stop here to save, or continue to check the counts
group_by(year) %>%
count(q4) %>%
pivot_wider(names_from=q4, values_from=n)
# A tibble: 3 x 5
# Groups: year [3]
# year `1` `2` `3` `4`
# <int> <int> <int> <int> <int>
#1 2016 116 88 102 94
#2 2017 86 114 85 115
#3 2018 98 98 113 91
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.