查找值的前 X 个百分位并将低于该百分位的所有值组合到每个组的其他行中 R

Question

我是 R 编码的新手，遇到了一些麻烦。 我试图在每个组的某个百分位数 (X%ile) 以上的列中找到值，然后将低于该百分位数的所有行组合成每个组的其他行。

我的情况与这里的问题非常相似： How to use fct_lump() to get the top n levels by group and put the rest in 'other'?

我按两列分组并尝试在第二列和第三列中添加行，在第二列中添加名称“其他”，并将同一行中第三列中低于百分位数的所有值相加。

我正在使用一个大的 dataframe (df)，其中有以下列：年份（类 = Integer，即 2007、2008，...）、SciName（类 = 字符）和花（类 = 数字）

我能够使用以下方法过滤并仅显示值高于某个百分位数的行：

df_filter <- df %>%
filter(Flowers > quantile(Flowers, 0.7))
view(df_filter)

但是，我一直无法找到添加我需要的其他行的方法

根据我上面链接的类似问题的公认答案，我尝试过：

df_Others <- df %>%
  ungroup() %>%
  group_by(Year) %>%
  arrange(desc(Flowers)) %>%
  mutate(a = row_number(-Flowers)) %>%
  mutate(SciName = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(SciName))) %>%
  mutate(a = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(a))) %>%
  group_by(Year, SciName, a) %>%
  summarize(Flowers = sum(Flowers)) %>%
  arrange(Year, a) %>%
  select(-a)

View(df_Others)

...但这不起作用

任何有关如何执行此操作的建议将不胜感激！

编辑：

输入：

Year    SciName    Flowers
2004    Liliac     2000
2004    Rose       3000
2004    Daisy      10
2004    Lily       5
2005    Liliac     20
2005    Rose       3
2005    Daisy      1000
2005    Lily       5000
...     ...        ...
 

Expected Output:
Year    SciName    Flowers
2004    Liliac     2000
2004    Rose       3000
2004    Others      15
2005    Daisy      1000
2005    Lily       5000
2005    Others     23
...     ...        ...

Answer 1

如果没有您的输入数据和预期的 output，则很难确定您当前的方法为何不起作用。 我推荐如下内容：

library(dplyr)
# access iris, a build in dataset
data(iris)

df = iris %>%
  # all the groups that you want to work within should be listed here
  group_by(Species) %>%
  # this creates a new column containing the threshold for each group
  # replace Sepal.Length with the numeric column you want the percentile of
  mutate(threshold = quantile(Sepal.Length, 0.7)) %>%
  # in case of duplicate values in Sepal.Length
  # creating row-numbers to avoid merging them
  mutate(rn = row_number()) %>%
  # setup variable that contains the desired output groups
  mutate(new_group = ifelse(Sepal.Length < threshold, "other", rn))

# pause here to inspect and confirm the new_group column gives the preferred groups

df = df  %>%
  group_by(Species, new_group) %>%
  # summarise values as desired
  summarise(num = n()) %>%
  select(Species, Sepal.Length, num)

编辑：为了响应您的输入结构和预期的 output，我建议如下：

output = df %>%
  group_by(Year) %>%
  mutate(threshold = quantile(Flowers, 0.7)) %>%
  mutate(new_group = ifelse(Flowers < threshold, "Others", SciName)) %>%
  group_by(Year, new_group) %>%
  summarise(Flowers = sum(Flowers)) %>%
  select(Year, SciName = new_group, Flowers)

因为我在这里没有使用行号，所以我假设SciName身份是唯一的。

查找值的前 X 个百分位并将低于该百分位的所有值组合到每个组的其他行中 R

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-06-15 21:09:38

查找值的前 X 个百分位并将低于该百分位的所有值组合到每个组的其他行中 R

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-06-15 21:09:38

解决方案1
0 已采纳 2022-06-15 21:09:38