![](/img/trans.png)
[英]How to calculate percentile [0,1) in R such that values lies below the percentile
[英]Finding the Top X percentile of values and combining all values below that percentile into an Others row per Group In R
我是 R 编码的新手,遇到了一些麻烦。 我试图在每个组的某个百分位数 (X%ile) 以上的列中找到值,然后将低于该百分位数的所有行组合成每个组的其他行。
我的情况与这里的问题非常相似: How to use fct_lump() to get the top n levels by group and put the rest in 'other'?
我按两列分组并尝试在第二列和第三列中添加行,在第二列中添加名称“其他”,并将同一行中第三列中低于百分位数的所有值相加。
我正在使用一个大的 dataframe (df),其中有以下列:年份(类 = Integer,即 2007、2008,...)、SciName(类 = 字符)和花(类 = 数字)
我能够使用以下方法过滤并仅显示值高于某个百分位数的行:
df_filter <- df %>%
filter(Flowers > quantile(Flowers, 0.7))
view(df_filter)
但是,我一直无法找到添加我需要的其他行的方法
根据我上面链接的类似问题的公认答案,我尝试过:
df_Others <- df %>%
ungroup() %>%
group_by(Year) %>%
arrange(desc(Flowers)) %>%
mutate(a = row_number(-Flowers)) %>%
mutate(SciName = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(SciName))) %>%
mutate(a = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(a))) %>%
group_by(Year, SciName, a) %>%
summarize(Flowers = sum(Flowers)) %>%
arrange(Year, a) %>%
select(-a)
View(df_Others)
...但这不起作用
任何有关如何执行此操作的建议将不胜感激!
编辑:
输入:
Year SciName Flowers
2004 Liliac 2000
2004 Rose 3000
2004 Daisy 10
2004 Lily 5
2005 Liliac 20
2005 Rose 3
2005 Daisy 1000
2005 Lily 5000
... ... ...
Expected Output:
Year SciName Flowers
2004 Liliac 2000
2004 Rose 3000
2004 Others 15
2005 Daisy 1000
2005 Lily 5000
2005 Others 23
... ... ...
如果没有您的输入数据和预期的 output,则很难确定您当前的方法为何不起作用。 我推荐如下内容:
library(dplyr)
# access iris, a build in dataset
data(iris)
df = iris %>%
# all the groups that you want to work within should be listed here
group_by(Species) %>%
# this creates a new column containing the threshold for each group
# replace Sepal.Length with the numeric column you want the percentile of
mutate(threshold = quantile(Sepal.Length, 0.7)) %>%
# in case of duplicate values in Sepal.Length
# creating row-numbers to avoid merging them
mutate(rn = row_number()) %>%
# setup variable that contains the desired output groups
mutate(new_group = ifelse(Sepal.Length < threshold, "other", rn))
# pause here to inspect and confirm the new_group column gives the preferred groups
df = df %>%
group_by(Species, new_group) %>%
# summarise values as desired
summarise(num = n()) %>%
select(Species, Sepal.Length, num)
编辑:为了响应您的输入结构和预期的 output,我建议如下:
output = df %>%
group_by(Year) %>%
mutate(threshold = quantile(Flowers, 0.7)) %>%
mutate(new_group = ifelse(Flowers < threshold, "Others", SciName)) %>%
group_by(Year, new_group) %>%
summarise(Flowers = sum(Flowers)) %>%
select(Year, SciName = new_group, Flowers)
因为我在这里没有使用行号,所以我假设SciName
身份是唯一的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.