简体   繁体   English

Data.Table 有条件地操作多列

[英]Data.Table Manipulate Many Columns Conditionally

data1=data.frame(Year=c(2010,2010,2010,2011,2011,2011,2010,2010,2010,2011,2011,2011),
                Group=c(1,1,1,1,1,1,2,2,2,2,2,2),
                Class=c('A','B','C','A','B','C','A','B','C','A','B','C'),
                A=c(0.73,0.55,0.54,0.49,0.52,0.49,0.26,0.55,0.39,0.34,0.84,0.29),
                B=c(0.12,0.08,0.14,0.21,0.33,0.98,0.33,0.99,0.02,0.59,0.27,0.72),
                C=c(0.43,0.51,0.29,0.6,0.28,0.97,0.78,0.84,0.34,0.82,0.75,0.97))


##>data1
##    Year Group Class    A    B    C
## 1  2010     1     A 0.73 0.12 0.43
## 2  2010     1     B 0.55 0.08 0.51
## 3  2010     1     C 0.54 0.14 0.29
## 4  2011     1     A 0.49 0.21 0.60
## 5  2011     1     B 0.52 0.33 0.28
## 6  2011     1     C 0.49 0.98 0.97
## 7  2010     2     A 0.26 0.33 0.78
## 8  2010     2     B 0.55 0.99 0.84
## 9  2010     2     C 0.39 0.02 0.34
## 10 2011     2     A 0.34 0.59 0.82
## 11 2011     2     B 0.84 0.27 0.75
## 12 2011     2     C 0.29 0.72 0.97  

I have 'data1' and wish to make 'data2'.我有“data1”并希望制作“data2”。 'data2' will have the same exact dimensions as 'data1' but I wish for the following conditions to be enacted, 'data2' 将具有与 'data1' 相同的精确尺寸,但我希望制定以下条件,

IF Class = 'A', then Column 'B' = (1-B)*0.05, Column 'C' = (1-C)*0.05, and after updating Column 'B' and Column 'C', we calculate Column 'A' = 1- (B+C). IF Class = 'A', then Column 'B' = (1-B)*0.05, Column 'C' = (1-C)*0.05, 更新Column 'B'和Column 'C'后,我们计算Column 'A' = 1- (B+C)。

IF Class = 'B', then Column 'A' = (1-A)*0.05, Column 'C' = (1-C)*0.05, and after updating Column 'A' and Column 'C', we calculate Column 'B' = 1- (A+C). IF Class = 'B',则Column 'A' = (1-A)*0.05,Column 'C' = (1-C)*0.05,更新Column 'A'和Column 'C'后,计算Column 'B' = 1- (A+C)。

IF Class = 'C', then Column 'A' = (1-A)*0.05, Column 'B' = (1-B)*0.05, and after > updating Column 'A' and Column 'B', we calculate Column 'C' = 1- (A+B). IF Class = 'C', then Column 'A' = (1-A)*0.05, Column 'B' = (1-B)*0.05, and after > update Column 'A' and Column 'B', 我们计算列'C' = 1- (A+B)。

I am hopeful for efficient data.table solution since I have very large dataset with many more 'Classes' than 3.我对高效的 data.table 解决方案充满希望,因为我有非常大的数据集,其中的“类”多于 3。


Here is a slow solution for making the hopeful updates.这是进行有希望的更新的缓慢解决方案。

library(data.table)
setDT(data1)

data1[, newB := fifelse(Class == 'A', (1-B) * 0.05, NA_real_)]
data1[, newC := fifelse(Class == 'A', (1-C) * 0.05, NA_real_)]
data1[, newA := fifelse(Class == 'A', (1-(newB+newC)), NA_real_)]

data1[, newA := fifelse(Class == 'B', (1-A) * 0.05, newA)]
data1[, newC := fifelse(Class == 'B', (1-C) * 0.05, newC)]
data1[, newB := fifelse(Class == 'B', (1-(newA+newC)), newB)]

data1[, newA := fifelse(Class == 'C', (1-A) * 0.05, newA)]
data1[, newB := fifelse(Class == 'C', (1-B) * 0.05, newB)]
data1[, newC := fifelse(Class == 'C', (1-(newA+newB)), newC)]

I suggest the following :我建议如下:

# Setting the dataframe as a data.table
data1 <- data.table::setDT(data1)
head(data1)
   Year Group Class    A    B    C
1: 2010     1     A 0.73 0.12 0.43
2: 2010     1     B 0.55 0.08 0.51
3: 2010     1     C 0.54 0.14 0.29
4: 2011     1     A 0.49 0.21 0.60
5: 2011     1     B 0.52 0.33 0.28
6: 2011     1     C 0.49 0.98 0.97
# First I copy this data.table
data2 = data.table::copy(data1)
# I store the variable names that I will change
list_of_var = setdiff(colnames(data1), c("Year", "Group", "Class"))
# In data1 I change by reference all these variable with the
# transformation (1-X)*0.05
data1[, (list_of_var) := lapply(.SD, function(x) (1-x)*0.05),.SDcols = list_of_var]

# Then for each of my variables
for (variable in list_of_var){
   # I store the names of the other variables

  cols <- setdiff(colnames(data1), c(variable, "Year", "Group", "Class"))
  # and apply the transformation conditionally on value of Class
  for (var in cols){
    data2[Class == variable, (var) := data1[Class == variable, var, with = F]]
  }
}

# After doing this I will now apply the 1-B-C transformation for A conditionally
# on Class, and same for each variable
for (variable in list_of_var){
  other_vars = setdiff(list_of_var, variable)
  new_var = apply(data2[Class == variable, ..other_vars], MARGIN = 1, sum)
  data2[Class == variable, (variable) := 1 - new_var]

}
head(data2)

This is now the result :这是现在的结果:

  Year Group Class      A      B      C
1: 2010     1     A 0.9275 0.0440 0.0285
2: 2010     1     B 0.0225 0.9530 0.0245
3: 2010     1     C 0.0230 0.0430 0.9340
4: 2011     1     A 0.9405 0.0395 0.0200
5: 2011     1     B 0.0240 0.9400 0.0360
6: 2011     1     C 0.0255 0.0010 0.9735

Edit OK, this has been bothering me all day.编辑好的,这一直困扰着我一整天。 How about this:这个怎么样:

data1[,.(Year, Group,
         A = if("A" == Class){1 - (((1-B) * 0.05) + ((1-C) * 0.05))}else{(1-A) * 0.05},
         B = if("B" == Class){1 - (((1-A) * 0.05) + ((1-C) * 0.05))}else{(1-B) * 0.05},
         C = if("C" == Class){1 - (((1-A) * 0.05) + ((1-B) * 0.05))}else{(1-C) * 0.05}),
         by=Class]
    Class Year Group      A      B      C
 1:     A 2010     1 0.9275 0.0440 0.0285
 2:     A 2011     1 0.9405 0.0395 0.0200
 3:     A 2010     2 0.9555 0.0335 0.0110
 4:     A 2011     2 0.9705 0.0205 0.0090
 5:     B 2010     1 0.0225 0.9530 0.0245
 6:     B 2011     1 0.0240 0.9400 0.0360
 7:     B 2010     2 0.0225 0.9695 0.0080
 8:     B 2011     2 0.0080 0.9795 0.0125
 9:     C 2010     1 0.0230 0.0430 0.9340
10:     C 2011     1 0.0255 0.0010 0.9735
11:     C 2010     2 0.0305 0.0490 0.9205
12:     C 2011     2 0.0355 0.0140 0.9505

And it works on 10,000,000 rows in less than a second.它可以在不到一秒的时间内处理 10,000,000 行。

data1 <- data.table(Year = rep(2011:2020,each=1000000),Group = rep(1:10,times=1000000),Class = LETTERS[1:3], A = runif(1000000,0,1),B = runif(1000000,0,1),C = runif(1000000,0,1))
data1
          Year Group Class            A            B             C
       1: 2011     1     A 0.2890449290 0.6917136966 0.79943333357
       2: 2011     2     B 0.6496694945 0.2168088856 0.61779720359
       3: 2011     3     C 0.8413182027 0.9084385505 0.90381150902
       4: 2011     4     A 0.7272625659 0.4355531749 0.91872303933
       5: 2011     5     B 0.7147752908 0.9534050962 0.75510455621
      ---                                                         
 9999996: 2020     6     C 0.7728334034 0.9656879159 0.03099721554
 9999997: 2020     7     A 0.8534086784 0.2145124320 0.74231260596
 9999998: 2020     8     B 0.4714033590 0.0653402030 0.63881201576
 9999999: 2020     9     C 0.5170788274 0.4878072820 0.53781165020
10000000: 2020    10     A 0.8130705466 0.6612007422 0.16215236858
microbenchmark(data1[,.(Year, Group,
+          A = if("A" == Class){1 - (((1-B) * 0.05) + ((1-C) * 0.05))}else{(1-A) * 0.05},
+          B = if("B" == Class){1 - (((1-A) * 0.05) + ((1-C) * 0.05))}else{(1-B) * 0.05},
+          C = if("C" == Class){1 - (((1-A) * 0.05) + ((1-B) * 0.05))}else{(1-C) * 0.05}),by=Class])
Unit: milliseconds
        min          lq        mean     median          uq        max neval
 538.850986 638.5327615 895.7115241 808.087257 999.4477005 2146.21263   100

Actually its not efficient but it works, maybe its helpful实际上它效率不高,但它有效,也许它有帮助

data1=data.frame(Year=c(2010,2010,2010,2011,2011,2011,2010,2010,2010,2011,2011,2011),
                Group=c(1,1,1,1,1,1,2,2,2,2,2,2),
                Class=c('A','B','C','A','B','C','A','B','C','A','B','C'),
                A=c(0.73,0.55,0.54,0.49,0.52,0.49,0.26,0.55,0.39,0.34,0.84,0.29),
                B=c(0.12,0.08,0.14,0.21,0.33,0.98,0.33,0.99,0.02,0.59,0.27,0.72),
                C=c(0.43,0.51,0.29,0.6,0.28,0.97,0.78,0.84,0.34,0.82,0.75,0.97))


let<-toupper(letters)

data2<-data1

data1[4:ncol(data1)]<-(1-data1[4:ncol(data1)])*0.05


for(i in 1:nrow(data1))
{
data1[i,(which(data2[i,3]==let)+3)]<-1-sum(data2[i,4:ncol(data2)][-which(data2[i,3]==let)])
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM