[英]randomly remove rows based on condition for one column in r
I need to balance my training set for a machine learning task where two categories are unevenly represented in a df.我需要为机器学习任务平衡我的训练集,其中两个类别在 df 中不均匀地表示。 I need an equal number of rows, so I need to remove at random rows from the larger category:我需要相同数量的行,所以我需要从较大的类别中随机删除行:
library(tidyverse)
set.seed(123)
mydata <-
tibble(
prod = sample(c("durum", "bread"), size = 1000, replace = T),
value = sample(0:20, size = 1000, replace = T)
)
prod_rows <- mydata %>% count(prod)
prod_rows
# A tibble: 2 x 2
prod n
* <chr> <int>
1 bread 494
2 durum 506
So I tried所以我尝试了
mydata_new <- mydata[- sample(1:nrow(mydata), abs(prod_rows$n[1] - prod_rows$n[2])), ]
Which works, but I want to remove only from rows where mydata$prod == "durum"
.哪个有效,但我只想从mydata$prod == "durum"
的行中删除。 That is, only from the larger category也就是说,只有从更大的类别
The answer to this question does most of what I would like to achieve.这个问题的答案完成了我想要实现的大部分目标。 However I need to retain the order of the rows as per original df, so can't separate then use bind_rows()
to replace them.但是我需要按照原始 df 保留行的顺序,所以不能分开然后使用bind_rows()
替换它们。
You can create a new column to maintain the original order of rows.您可以创建一个新列来保持行的原始顺序。
library(dplyr)
keep_df <- prod_rows %>%
summarise(prod = prod[which.max(n)],
n = min(n))
keep_df
# A tibble: 1 x 2
# prod n
# <chr> <int>
#1 durum 494
mydata <- mydata %>% mutate(original_order = row_number())
# A tibble: 1,000 x 3
# prod value original_order
# <chr> <int> <int>
# 1 durum 12 1
# 2 durum 7 2
# 3 durum 1 3
# 4 bread 5 4
# 5 durum 16 5
# 6 bread 13 6
# 7 bread 6 7
# 8 bread 12 8
# 9 durum 7 9
#10 durum 5 10
# … with 990 more rows
Remove the additional row and bind the data and arrange
the data according to original order.删除附加行并绑定数据并按照原始顺序arrange
数据。
mydata %>%
inner_join(keep_df, by = 'prod') %>%
sample_n(first(n)) %>%
bind_rows(mydata %>% anti_join(keep_df, by = 'prod')) %>%
arrange(original_order) %>%
select(-original_order, -n) -> result
result
# A tibble: 988 x 2
# prod value
# <chr> <int>
# 1 durum 12
# 2 durum 7
# 3 durum 1
# 4 bread 5
# 5 bread 13
# 6 bread 6
# 7 bread 12
# 8 durum 7
# 9 durum 5
#10 bread 10
# … with 978 more rows
Check result
:检查result
:
result %>% count(prod)
# A tibble: 2 x 2
# prod n
# <chr> <int>
#1 bread 494
#2 durum 494
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.