简体   繁体   English

如何使用R中的查找方法基于其他列计算新列?

[英]How to calculate a new column based on other columns using a lookup approach in R?

I am trying to calculate another column in a dataframe based on another columns and a lookup table. 我正在尝试根据其他列和查找表来计算数据框中的另一列。 I have a simple example that only shows few data (my real dataset contains millions of rows). 我有一个简单的示例,该示例仅显示很少的数据(我的真实数据集包含数百万行)。

I have the following datasets: 我有以下数据集:

  lookup<- data.frame("class"=c(1, 2, 1, 2), "type"=c("A", "B", "B", "A"), 
           "condition1"=c(50, 60, 55, 53), "condition2"=c(80, 85, 86, 83))

  lookup
  class type condition1 condition2
      1    A         50         80
      2    B         60         85
      1    B         55         86
      2    A         53         83

My dataframe is of this shape: 我的数据框是这样的形状:

  data<- data.frame("class"=c(1, 2, 2, 1, 2, 1), 
         "type"=c("A","B", "A", "A", "B", "B"), 
         "percentage_condition1"=c(0.3, 0.6, 0.1, 0.2, 0.4, 0.5), 
         "percentage_condition2"=c(0.7, 0.4, 0.9, 0.8, 0.6, 0.5))


  data
  class type percentage_condition1 percentage_condition2
    1    A                   0.3                   0.7
    2    B                   0.6                   0.4
    2    A                   0.1                   0.9
    1    A                   0.2                   0.8
    2    B                   0.4                   0.6
    1    B                   0.5                   0.5

I would like to create a new column in my dataframe named data that will use the lookup table such as : 我想在我的数据框中创建一个名为data的新列,该列将使用查找表,例如:

in my data where my class matches my type columns, it can calculate a new column in my dataframe data such as (not real code): 在我的与我的类型列匹配的数据中,它可以在数据框数据中计算一个新列,例如(非真实代码):

d$new<- lookup$condition1 * data$percentage_condition1 + lookup$condition2 * data$percentage_condition2 d $ new <-查找$ condition1 *数据$ percentage_condition1 +查找$ condition2 *数据$ percentage_condition2

I know how to do it with a if else statement but I am trying to do it more efficiently as I am working with a lot of data. 我知道如何使用if else语句执行此操作,但是由于要处理大量数据,因此我试图更有效地执行此操作。 I know to do it with one column in the lookup table but I do not succeed with several columns (class and type column). 我知道要用查找表中的一列来完成此操作,但是我不能成功使用几列(类和类型列)。

Thanks for any help and suggestions! 感谢您的帮助和建议!

We can use match to get the index of 'type' columns for 'data' and 'type', use that index to get the corresponding rows of 'condition1', 'condition2' columns, multiply with the percentage columns of 'data' and get the rowSums 我们可以使用match获取“数据”和“类型”的“类型”列的索引,使用该索引获取“ condition1”,“ condition2”列的相应行,并乘以“ data”和获取rowSums

data$new <- rowSums(lookup[match(paste(data$class, data$type), 
                  paste(lookup$class, lookup$type)), 
               c("condition1", "condition2")] * data[3:4])

data
#  class type percentage_condition1 percentage_condition2  new
#1     1    A                   0.3                   0.7 71.0
#2     2    B                   0.6                   0.4 70.0
#3     2    A                   0.1                   0.9 80.0
#4     1    A                   0.2                   0.8 74.0
#5     2    B                   0.4                   0.6 75.0
#6     1    B                   0.5                   0.5 70.5

NOTE: With match , we can do it much easier 注意:使用match ,我们可以轻松完成


Or using data.table 或使用data.table

library(data.table)
setDT(data)[lookup, new := condition1 * percentage_condition1 + 
       condition2 * percentage_condition2, on = .(class, type)]
data
#   class type percentage_condition1 percentage_condition2  new
#1:     1    A                   0.3                   0.7 71.0
#2:     2    B                   0.6                   0.4 70.0
#3:     2    A                   0.1                   0.9 80.0
#4:     1    A                   0.2                   0.8 74.0
#5:     2    B                   0.4                   0.6 75.0
#6:     1    B                   0.5                   0.5 70.5

Or using tidyverse 或使用tidyverse

library(tidyverse)
data %>% 
     left_join(lookup, by = c("class", "type")) %>%
     mutate(new = condition1 * percentage_condition1 + 
       condition2 * percentage_condition2) %>%
     select(names(data), new)
#   class type percentage_condition1 percentage_condition2  new
#1     1    A                   0.3                   0.7 71.0
#2     2    B                   0.6                   0.4 70.0
#3     2    A                   0.1                   0.9 80.0
#4     1    A                   0.2                   0.8 74.0
#5     2    B                   0.4                   0.6 75.0
#6     1    B                   0.5                   0.5 70.5

Or use a SQL based solution with sqldf 或与sqldf一起使用基于SQL的解决方案

library(sqldf)
str1 <- "SELECT data.class, data.type, data.percentage_condition1, 
  data.percentage_condition2, (data.percentage_condition1 * lookup.condition1 + 
   data.percentage_condition2 * lookup.condition2) as new
   FROM data 
   LEFT JOIN lookup on data.class = lookup.class AND 
   data.type = lookup.type"
sqldf(str1)

Or as @G.Grothendieck mentioned in the comments, with alias identifiers, sqldf solution can be made more compact 或如评论中提到的@ G.Grothendieck一样,使用别名标识符,可以使sqldf解决方案更紧凑

sqldf("select D.*, L.condition1 * D.[percentage_condition1] + 
       L.condition2 * D.[percentage_condition2] as new 
       from data as D 
       left join lookup as L 
       using(class, type)")

NOTE: All the solutions maintains the original order of the dataset 注意:所有解决方案均保持数据集的原始顺序

One option is to merge data and lookup and then perform the calculation 一种选择是merge datalookup ,然后执行计算

df1 <- merge(data, lookup) #This merges by class and type columns

df1$new <- with(df1, (condition1 * percentage_condition1) + 
                     (condition2 * percentage_condition2))


df1
#  class type percentage_condition1 percentage_condition2 condition1 condition2  new
#1     1    A                   0.3                   0.7         50         80 71.0
#2     1    A                   0.2                   0.8         50         80 74.0
#3     1    B                   0.5                   0.5         55         86 70.5
#4     2    A                   0.1                   0.9         53         83 80.0
#5     2    B                   0.6                   0.4         60         85 70.0
#6     2    B                   0.4                   0.6         60         85 75.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM