简体   繁体   English

如何随机删除数据框中的行但仅针对特定子组(使用 dplyr::sample_n?)

[英]How to randomly remove rows in dataframe but for a specific subgroup only (with dplyr::sample_n?)

In a specific column, I have several categories.在一个特定的专栏中,我有几个类别。 I want to thin/dilute/remove randomly some rows only in one category .我想仅在一个类别中随机稀释/稀释/删除一些行。 I've seen sample_n used with group_by , but its size argument applies the removal of same number of rows for each category in the grouped variable.我已经看到sample_ngroup_by ,但它的size参数适用于为分组变量中的每个类别删除相同数量的行。 I want to specify different size for each group.我想为每个组指定不同的size

Second, I'm looking to do it "in place", meaning that I want it to return the same original dataframe, just that now it will have fewer rows in the specific category I sought to "dilute".其次,我希望“就地”执行此操作,这意味着我希望它返回相同的原始数据帧,只是现在我试图“稀释”的特定类别中的行将减少。

Example Data示例数据

library(tidyverse)

set.seed(123)

df <- 
  tibble(
  color = sample(c("red", "blue", "yellow", "green", "brown"), size = 1000, replace = T),
  value = sample(0:750, size = 1000, replace = T)
)

df

## # A tibble: 1,000 x 2
##    color  value
##    <chr>  <int>
##  1 yellow   251
##  2 yellow   389
##  3 blue     742
##  4 blue     227
##  5 yellow   505
##  6 brown     47
##  7 green    381
##  8 red      667
##  9 blue     195
## 10 yellow   680
## # ... with 990 more rows

When tally by color I see that:按颜色统计时,我看到:

df %>% count(color)

  color      n
  <chr>  <int>
1 blue     204
2 brown    202
3 green    191
4 red      203
5 yellow   200

Now let's say that I want to decrease the number of rows only for red color.现在假设我只想减少red的行数。 Let's say I want only 10 rows for color == red .假设我只需要10color == red Simply using sample_n doesn't get me there, obviously:显然,简单地使用sample_n并不能让我到达那里:

df %>%
  group_by(color) %>%
  sample_n(10) %>%
  count(color)

  color      n
  <chr>  <int>
1 blue      10
2 brown     10
3 green     10
4 red       10
5 yellow    10

How can I specify that only color == "red" will have 10 rows while the other colors remain untouched?如何指定只有color == "red"将有10行而其他颜色保持不变?

I've seen some similar questions ( like this one ), but wasn't able to adapt the answers to my case.我见过一些类似的问题( 比如这个),但无法根据我的情况调整答案。

We can write a function to filter specific colors, sample them and bind them with the orignal data我们可以编写一个函数来filter特定的颜色,对它们进行采样并将它们与原始数据绑定

library(dplyr)

sample_for_color <- function(data, col_to_change, n) {
  data %>%
    filter(color %in% col_to_change) %>%
    group_by(color) %>%
    slice_sample(n = n) %>%
    ungroup %>%
    bind_rows(data %>% filter(!color %in% col_to_change))
}

new_df <- df %>% sample_for_color('red', 10)
new_df %>% count(color)

#  color      n
#  <chr>  <int>
#1 blue     204
#2 brown    202
#3 green    191
#4 red       10
#5 yellow   200

new_df <- df %>% sample_for_color(c('red', 'blue'), 10)
new_df %>% count(color)

#  color      n
#  <chr>  <int>
#1 blue      10
#2 brown    202
#3 green    191
#4 red       10
#5 yellow   200

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM