[英]Data.Table Manipulate Many Columns Conditionally
data1=data.frame(Year=c(2010,2010,2010,2011,2011,2011,2010,2010,2010,2011,2011,2011),
Group=c(1,1,1,1,1,1,2,2,2,2,2,2),
Class=c('A','B','C','A','B','C','A','B','C','A','B','C'),
A=c(0.73,0.55,0.54,0.49,0.52,0.49,0.26,0.55,0.39,0.34,0.84,0.29),
B=c(0.12,0.08,0.14,0.21,0.33,0.98,0.33,0.99,0.02,0.59,0.27,0.72),
C=c(0.43,0.51,0.29,0.6,0.28,0.97,0.78,0.84,0.34,0.82,0.75,0.97))
##>data1
## Year Group Class A B C
## 1 2010 1 A 0.73 0.12 0.43
## 2 2010 1 B 0.55 0.08 0.51
## 3 2010 1 C 0.54 0.14 0.29
## 4 2011 1 A 0.49 0.21 0.60
## 5 2011 1 B 0.52 0.33 0.28
## 6 2011 1 C 0.49 0.98 0.97
## 7 2010 2 A 0.26 0.33 0.78
## 8 2010 2 B 0.55 0.99 0.84
## 9 2010 2 C 0.39 0.02 0.34
## 10 2011 2 A 0.34 0.59 0.82
## 11 2011 2 B 0.84 0.27 0.75
## 12 2011 2 C 0.29 0.72 0.97
I have 'data1' and wish to make 'data2'.我有“data1”并希望制作“data2”。 'data2' will have the same exact dimensions as 'data1' but I wish for the following conditions to be enacted, 'data2' 将具有与 'data1' 相同的精确尺寸,但我希望制定以下条件,
IF Class = 'A', then Column 'B' = (1-B)*0.05, Column 'C' = (1-C)*0.05, and after updating Column 'B' and Column 'C', we calculate Column 'A' = 1- (B+C). IF Class = 'A', then Column 'B' = (1-B)*0.05, Column 'C' = (1-C)*0.05, 更新Column 'B'和Column 'C'后,我们计算Column 'A' = 1- (B+C)。
IF Class = 'B', then Column 'A' = (1-A)*0.05, Column 'C' = (1-C)*0.05, and after updating Column 'A' and Column 'C', we calculate Column 'B' = 1- (A+C). IF Class = 'B',则Column 'A' = (1-A)*0.05,Column 'C' = (1-C)*0.05,更新Column 'A'和Column 'C'后,计算Column 'B' = 1- (A+C)。
IF Class = 'C', then Column 'A' = (1-A)*0.05, Column 'B' = (1-B)*0.05, and after > updating Column 'A' and Column 'B', we calculate Column 'C' = 1- (A+B). IF Class = 'C', then Column 'A' = (1-A)*0.05, Column 'B' = (1-B)*0.05, and after > update Column 'A' and Column 'B', 我们计算列'C' = 1- (A+B)。
I am hopeful for efficient data.table solution since I have very large dataset with many more 'Classes' than 3.我对高效的 data.table 解决方案充满希望,因为我有非常大的数据集,其中的“类”多于 3。
Here is a slow solution for making the hopeful updates.这是进行有希望的更新的缓慢解决方案。
library(data.table)
setDT(data1)
data1[, newB := fifelse(Class == 'A', (1-B) * 0.05, NA_real_)]
data1[, newC := fifelse(Class == 'A', (1-C) * 0.05, NA_real_)]
data1[, newA := fifelse(Class == 'A', (1-(newB+newC)), NA_real_)]
data1[, newA := fifelse(Class == 'B', (1-A) * 0.05, newA)]
data1[, newC := fifelse(Class == 'B', (1-C) * 0.05, newC)]
data1[, newB := fifelse(Class == 'B', (1-(newA+newC)), newB)]
data1[, newA := fifelse(Class == 'C', (1-A) * 0.05, newA)]
data1[, newB := fifelse(Class == 'C', (1-B) * 0.05, newB)]
data1[, newC := fifelse(Class == 'C', (1-(newA+newB)), newC)]
I suggest the following :我建议如下:
# Setting the dataframe as a data.table
data1 <- data.table::setDT(data1)
head(data1)
Year Group Class A B C
1: 2010 1 A 0.73 0.12 0.43
2: 2010 1 B 0.55 0.08 0.51
3: 2010 1 C 0.54 0.14 0.29
4: 2011 1 A 0.49 0.21 0.60
5: 2011 1 B 0.52 0.33 0.28
6: 2011 1 C 0.49 0.98 0.97
# First I copy this data.table
data2 = data.table::copy(data1)
# I store the variable names that I will change
list_of_var = setdiff(colnames(data1), c("Year", "Group", "Class"))
# In data1 I change by reference all these variable with the
# transformation (1-X)*0.05
data1[, (list_of_var) := lapply(.SD, function(x) (1-x)*0.05),.SDcols = list_of_var]
# Then for each of my variables
for (variable in list_of_var){
# I store the names of the other variables
cols <- setdiff(colnames(data1), c(variable, "Year", "Group", "Class"))
# and apply the transformation conditionally on value of Class
for (var in cols){
data2[Class == variable, (var) := data1[Class == variable, var, with = F]]
}
}
# After doing this I will now apply the 1-B-C transformation for A conditionally
# on Class, and same for each variable
for (variable in list_of_var){
other_vars = setdiff(list_of_var, variable)
new_var = apply(data2[Class == variable, ..other_vars], MARGIN = 1, sum)
data2[Class == variable, (variable) := 1 - new_var]
}
head(data2)
This is now the result :这是现在的结果:
Year Group Class A B C
1: 2010 1 A 0.9275 0.0440 0.0285
2: 2010 1 B 0.0225 0.9530 0.0245
3: 2010 1 C 0.0230 0.0430 0.9340
4: 2011 1 A 0.9405 0.0395 0.0200
5: 2011 1 B 0.0240 0.9400 0.0360
6: 2011 1 C 0.0255 0.0010 0.9735
Edit OK, this has been bothering me all day.编辑好的,这一直困扰着我一整天。 How about this:这个怎么样:
data1[,.(Year, Group,
A = if("A" == Class){1 - (((1-B) * 0.05) + ((1-C) * 0.05))}else{(1-A) * 0.05},
B = if("B" == Class){1 - (((1-A) * 0.05) + ((1-C) * 0.05))}else{(1-B) * 0.05},
C = if("C" == Class){1 - (((1-A) * 0.05) + ((1-B) * 0.05))}else{(1-C) * 0.05}),
by=Class]
Class Year Group A B C
1: A 2010 1 0.9275 0.0440 0.0285
2: A 2011 1 0.9405 0.0395 0.0200
3: A 2010 2 0.9555 0.0335 0.0110
4: A 2011 2 0.9705 0.0205 0.0090
5: B 2010 1 0.0225 0.9530 0.0245
6: B 2011 1 0.0240 0.9400 0.0360
7: B 2010 2 0.0225 0.9695 0.0080
8: B 2011 2 0.0080 0.9795 0.0125
9: C 2010 1 0.0230 0.0430 0.9340
10: C 2011 1 0.0255 0.0010 0.9735
11: C 2010 2 0.0305 0.0490 0.9205
12: C 2011 2 0.0355 0.0140 0.9505
And it works on 10,000,000 rows in less than a second.它可以在不到一秒的时间内处理 10,000,000 行。
data1 <- data.table(Year = rep(2011:2020,each=1000000),Group = rep(1:10,times=1000000),Class = LETTERS[1:3], A = runif(1000000,0,1),B = runif(1000000,0,1),C = runif(1000000,0,1))
data1
Year Group Class A B C
1: 2011 1 A 0.2890449290 0.6917136966 0.79943333357
2: 2011 2 B 0.6496694945 0.2168088856 0.61779720359
3: 2011 3 C 0.8413182027 0.9084385505 0.90381150902
4: 2011 4 A 0.7272625659 0.4355531749 0.91872303933
5: 2011 5 B 0.7147752908 0.9534050962 0.75510455621
---
9999996: 2020 6 C 0.7728334034 0.9656879159 0.03099721554
9999997: 2020 7 A 0.8534086784 0.2145124320 0.74231260596
9999998: 2020 8 B 0.4714033590 0.0653402030 0.63881201576
9999999: 2020 9 C 0.5170788274 0.4878072820 0.53781165020
10000000: 2020 10 A 0.8130705466 0.6612007422 0.16215236858
microbenchmark(data1[,.(Year, Group,
+ A = if("A" == Class){1 - (((1-B) * 0.05) + ((1-C) * 0.05))}else{(1-A) * 0.05},
+ B = if("B" == Class){1 - (((1-A) * 0.05) + ((1-C) * 0.05))}else{(1-B) * 0.05},
+ C = if("C" == Class){1 - (((1-A) * 0.05) + ((1-B) * 0.05))}else{(1-C) * 0.05}),by=Class])
Unit: milliseconds
min lq mean median uq max neval
538.850986 638.5327615 895.7115241 808.087257 999.4477005 2146.21263 100
Actually its not efficient but it works, maybe its helpful实际上它效率不高,但它有效,也许它有帮助
data1=data.frame(Year=c(2010,2010,2010,2011,2011,2011,2010,2010,2010,2011,2011,2011),
Group=c(1,1,1,1,1,1,2,2,2,2,2,2),
Class=c('A','B','C','A','B','C','A','B','C','A','B','C'),
A=c(0.73,0.55,0.54,0.49,0.52,0.49,0.26,0.55,0.39,0.34,0.84,0.29),
B=c(0.12,0.08,0.14,0.21,0.33,0.98,0.33,0.99,0.02,0.59,0.27,0.72),
C=c(0.43,0.51,0.29,0.6,0.28,0.97,0.78,0.84,0.34,0.82,0.75,0.97))
let<-toupper(letters)
data2<-data1
data1[4:ncol(data1)]<-(1-data1[4:ncol(data1)])*0.05
for(i in 1:nrow(data1))
{
data1[i,(which(data2[i,3]==let)+3)]<-1-sum(data2[i,4:ncol(data2)][-which(data2[i,3]==let)])
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.