简体   繁体   English

根据条件根据组(按行)data.frame替换每列中的值

[英]Replace values in each column based on conditions according to groups (by rows) data.frame

I have a data.frame, dim = 400 rows and 15000 columns. 我有一个data.frame,dim = 400行和15000列。 I would like to apply a condition where for rows belonging to each group, defined by df$Group , I have to check if the group has values in more than 50% of the rows. 我想应用一种条件,其中对于属于每个组的行(由df$Group定义),我必须检查该组的值是否超过行的50%。 If yes, then keep then existing values, else replace all by 0 . 如果是,则保留现有值,否则全部替换为0

for example, for group a df[1:6,1] , if sum(df[1:6,1] == 0)/length(df[1:6,1]) >50% , then all values in df[1:6,1] will be replace with 0 . 例如,对于组df[1:6,1]if sum(df[1:6,1] == 0)/length(df[1:6,1]) >50% ,则所有值df[1:6,1]将替换为0 Else the existing values will remain. 否则,将保留现有值。

Sample input: 输入样例:

df <- read.table(text= "DATA  r1    r2  r3  Group
a1  6835    256 0   a
a2  5395    0   67  a
a3  7746    0   30  a
a4  7496    556 50  a
a5  5780    255 0   a
a6  6060    603 0   a
b1  0   0   0   b
b2  0   258 0   b
b3  0   0   0   b
b4  0   0   0   b
b5  5099    505 0   b
b6  0   680 0   b
c1  8443    4900    280 c
c2  8980    4949    0   c
c3  7828    0   0   c
c4  6509    3257    0   c
c5  6563    0   49  c
", header=TRUE, na.strings=NA,row.name=1)
dt <- as.data.table(df) #or data.frame

Expected output: 预期产量:

>df
DATA   r1     r2    r3  Group
 a1   6835   256    0     a
 a2   5395     0   67     a
 a3   7746     0   30     a
 a4   7496   556   50     a
 a5   5780   255    0     a
 a6   6060   603    0     a
 b1      0     0    0     b
 b2      0   258    0     b
 b3      0     0    0     b
 b4      0     0    0     b
 b5      0   505    0     b
 b6      0   680    0     b
c1    8443  4900    0     c
c2    8980  4949    0     c
c3    7828     0    0     c
c4    6509  3257    0     c
c5    6563     0    0     c

Update: This bug, #4957 is now fixed in v1.8.11 . 更新:此错误#4957现在已在v1.8.11修复 From NEWS : 来自新闻

Fixing #5007 also fixes #4957, where .N was not visible during lapply(.SD, function(x) ...) in j . 修复#5007也修复了#4957,其中在j lapply(.SD, function(x) ...)期间看lapply(.SD, function(x) ...) .N Thanks to juba for noticing it here on SO: Replace values in each column based on conditions according to groups (by rows) data.frame 感谢juba在SO上注意到它: 根据条件(根据行(按行))替换每列中的值。


Here is a way with data.table : 这是使用data.table的方法:

dt[, lapply(.SD, function(v) {
    len <- length(v)
    if((sum(v==0)/len)>0.5) rep(0L,len) else v
}), by="Group", .SDcols=c("r1","r2","r3")]

Which gives : 这使 :

   Group   r1   r2 r3
 1:     a 6835  256  0
 2:     a 5395    0 67
 3:     a 7746    0 30
 4:     a 7496  556 50
 5:     a 5780  255  0
 6:     a 6060  603  0
 7:     b    0    0  0
 8:     b    0  258  0
 9:     b    0    0  0
10:     b    0    0  0
11:     b    0  505  0
12:     b    0  680  0
13:     c 8443 4900  0
14:     c 8980 4949  0
15:     c 7828    0  0
16:     c 6509 3257  0
17:     c 6563    0  0

Quick and dirty: 快速又肮脏:

ff<-function(x){
  if(is.numeric(x)){
    b<-by(x==0,df$Group,mean)
    x[df$Group %in% names(b)[b>0.5]]<-0 
  }
  x
}

data.frame(lapply(df,ff))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据一列值替换data.frame中的数值 - Replace numerical values in a data.frame according to a column of values 根据所有组中值的长度过滤data.frame中的行 - Filter rows in data.frame based on the length of values in all groups 根据不同列中的值替换data.frame列中的值 - Replace values in a data.frame column based on values in a different column 根据另一个data.frame替换data.frame中的某些列值 - replace some column values from a data.frame based on another data.frame 在 R 中构造循环以根据来自 data.frame B 的列值的匹配子集搜索和替换 data.frame A 中的值? - Construct loop in R to search and replace values in data.frame A based on matched subsets of column values from data.frame B? 将一个 data.frame 分成 n 个随机组,每个组有 x 行 - Split a data.frame into n random groups with x rows each R-根据data.frame后续行中的条件为分组的条目填充一列 - R - populate a column for grouped entries based on conditions in subsequent rows of a data.frame 根据POSIXct列的多个条件,从R data.frame中删除行 - Remove rows from R data.frame based on multiple conditions for POSIXct column R:根据日期列的值是否落在另一个data.frame中的时间范围内,用组汇总一个data.frame中的行组 - R: Summarize groups of rows in one data.frame with groups based on whether a date column's value falls in a time range in another data.frame data.frame:通过将函数应用于行组来创建列 - data.frame: create column by applying a function to groups of rows
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM