In a specific column, I have several categories. I want to thin/dilute/remove randomly some rows only in one category . I've seen sample_n
used with group_by
, but its size
argument applies the removal of same number of rows for each category in the grouped variable. I want to specify different size
for each group.
Second, I'm looking to do it "in place", meaning that I want it to return the same original dataframe, just that now it will have fewer rows in the specific category I sought to "dilute".
library(tidyverse)
set.seed(123)
df <-
tibble(
color = sample(c("red", "blue", "yellow", "green", "brown"), size = 1000, replace = T),
value = sample(0:750, size = 1000, replace = T)
)
df
## # A tibble: 1,000 x 2
## color value
## <chr> <int>
## 1 yellow 251
## 2 yellow 389
## 3 blue 742
## 4 blue 227
## 5 yellow 505
## 6 brown 47
## 7 green 381
## 8 red 667
## 9 blue 195
## 10 yellow 680
## # ... with 990 more rows
When tally by color I see that:
df %>% count(color)
color n
<chr> <int>
1 blue 204
2 brown 202
3 green 191
4 red 203
5 yellow 200
Now let's say that I want to decrease the number of rows only for red
color. Let's say I want only 10
rows for color == red
. Simply using sample_n
doesn't get me there, obviously:
df %>%
group_by(color) %>%
sample_n(10) %>%
count(color)
color n
<chr> <int>
1 blue 10
2 brown 10
3 green 10
4 red 10
5 yellow 10
How can I specify that only color == "red"
will have 10
rows while the other colors remain untouched?
I've seen some similar questions ( like this one ), but wasn't able to adapt the answers to my case.
We can write a function to filter
specific colors, sample them and bind them with the orignal data
library(dplyr)
sample_for_color <- function(data, col_to_change, n) {
data %>%
filter(color %in% col_to_change) %>%
group_by(color) %>%
slice_sample(n = n) %>%
ungroup %>%
bind_rows(data %>% filter(!color %in% col_to_change))
}
new_df <- df %>% sample_for_color('red', 10)
new_df %>% count(color)
# color n
# <chr> <int>
#1 blue 204
#2 brown 202
#3 green 191
#4 red 10
#5 yellow 200
new_df <- df %>% sample_for_color(c('red', 'blue'), 10)
new_df %>% count(color)
# color n
# <chr> <int>
#1 blue 10
#2 brown 202
#3 green 191
#4 red 10
#5 yellow 200
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.