简体   繁体   English

根据 r 中一列的条件随机删除行

[英]randomly remove rows based on condition for one column in r

I need to balance my training set for a machine learning task where two categories are unevenly represented in a df.我需要为机器学习任务平衡我的训练集,其中两个类别在 df 中不均匀地表示。 I need an equal number of rows, so I need to remove at random rows from the larger category:我需要相同数量的行,所以我需要从较大的类别中随机删除行:

library(tidyverse)
set.seed(123)

mydata <- 
  tibble(
    prod = sample(c("durum", "bread"), size = 1000, replace = T),
    value = sample(0:20, size = 1000, replace = T)
  )

prod_rows <- mydata %>% count(prod)
prod_rows

# A tibble: 2 x 2
  prod      n
* <chr> <int>
1 bread   494
2 durum   506

So I tried所以我尝试了

mydata_new <- mydata[- sample(1:nrow(mydata), abs(prod_rows$n[1] - prod_rows$n[2])), ]

Which works, but I want to remove only from rows where mydata$prod == "durum" .哪个有效,但我只想从mydata$prod == "durum"的行中删除。 That is, only from the larger category也就是说,只有从更大的类别

The answer to this question does most of what I would like to achieve.这个问题的答案完成了我想要实现的大部分目标。 However I need to retain the order of the rows as per original df, so can't separate then use bind_rows() to replace them.但是我需要按照原始 df 保留行的顺序,所以不能分开然后使用bind_rows()替换它们。

You can create a new column to maintain the original order of rows.您可以创建一个新列来保持行的原始顺序。

library(dplyr)

keep_df <- prod_rows %>%
  summarise(prod = prod[which.max(n)], 
            n = min(n))
keep_df

# A tibble: 1 x 2
#  prod      n
#  <chr> <int>
#1 durum   494

mydata <- mydata %>% mutate(original_order = row_number())
  
# A tibble: 1,000 x 3
#   prod  value original_order
#   <chr> <int>          <int>
# 1 durum    12              1
# 2 durum     7              2
# 3 durum     1              3
# 4 bread     5              4
# 5 durum    16              5
# 6 bread    13              6
# 7 bread     6              7
# 8 bread    12              8
# 9 durum     7              9
#10 durum     5             10
# … with 990 more rows

Remove the additional row and bind the data and arrange the data according to original order.删除附加行并绑定数据并按照原始顺序arrange数据。

mydata %>%
  inner_join(keep_df, by = 'prod') %>%
  sample_n(first(n)) %>%
  bind_rows(mydata %>% anti_join(keep_df, by = 'prod')) %>%
  arrange(original_order) %>% 
  select(-original_order, -n) -> result

result

# A tibble: 988 x 2
#   prod  value
#   <chr> <int>
# 1 durum    12
# 2 durum     7
# 3 durum     1
# 4 bread     5
# 5 bread    13
# 6 bread     6
# 7 bread    12
# 8 durum     7
# 9 durum     5
#10 bread    10
# … with 978 more rows

Check result :检查result

result %>% count(prod)

# A tibble: 2 x 2
#  prod      n
#  <chr> <int>
#1 bread   494
#2 durum   494

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM