将新列添加到一个 Dataframe 基于另一个 R 中的条件的值和函数

Question

A tricky one for you.对你来说是一个棘手的问题。 I have two data frames, one that is a list of odd ratios (skipping the first as it is our predictor).我有两个数据框，一个是奇数比率列表（跳过第一个，因为它是我们的预测器）。 See below:见下文：

Variable Name变量的名称	Odds赔率
Var2变量2	0.87 0.87
Var3变量 3	1.42 1.42
Var4变量4	2.10 2.10
Var5变量5	0.56 0.56
Var6变量6	1.01 1.01

The second is a list of subjects, with the variables as columns and whether it is present as a binary flag (0/1) .第二个是主题列表，变量作为列，它是否以二进制标志(0/1)的形式出现。 See below:见下文：

Subject Name主题名称	Var 1 (Predictor) Var 1（预测器）	Var 3变量 3	Var 4变量 4	Var 5变量 5	Var 6变量 6
Dog狗	1 1	1 1	0 0	0 0	1 1
Cat猫	1 1	0 0	0 0	1 1	1 1
Elephant大象	0 0	0 0	0 0	0 0	0 0
Bear熊	1 1	0 0	1 1	1 1	1 1
Jackal豺狼	0 0	0 0	0 0	0 0	1 1

What I now need to do is add that x number of new variables by their odds ratio if present.我现在需要做的是通过它们的优势比添加x个新变量（如果存在）。 For example.例如。 Dog has 2 present predictors. Dog有 2 个当前预测变量。 Var 3 and Var 6 so we want to multiple Var3=1.42 * Var6=1.01 = 1.43 leaving the others blank. Var 3和Var 6所以我们想要Var3=1.42 * Var6=1.01 = 1.43 将其他的留空。 For Jackal we have one predictor in Var 6=1.01 leaving the others blank.对于Jackal ，我们在Var 6=1.01中有一个预测变量，其他变量留空。 Adding these in as individual variables and their total multiplication is prefered.将这些作为单个变量添加，并且它们的总乘法是首选。

The number of variables may change from five, too six or seven, so specifying names for the function will not work.变量的数量可能从五个、六个或七个变化，因此为 function 指定名称将不起作用。 It needs to be based on the number of initial variables (-1 for the predictor).它需要基于初始变量的数量（预测变量为-1）。

Thus far I have tried writing a complex ifelse statement but it doesn't work for the dynamic range.到目前为止，我已经尝试编写一个复杂的ifelse语句，但它不适用于动态范围。 Something on matching names which get added in with a prefix / suffix?在匹配名称中添加前缀/后缀的内容？ Or by position?还是通过 position？ I am genuinely stumped in doing this the most efficient way.我真的很难以最有效的方式做到这一点。 Hope it's clear, happy to provide more detail if required / requested.希望很清楚，如果需要/要求，很乐意提供更多详细信息。

Answer 1

Let us assume that your first data frame is called odds_df and your second is called presence_df .让我们假设您的第一个数据框称为odds_df ，而您的第二个数据框称为presence_df 。 I have taken the data from your question and made reproducible versions of these at the bottom of this answer.我已从您的问题中获取数据，并在此答案的底部制作了可重复的版本。

As you may have different numbers of columns representing variables in your presence / absence data frame, you will need some way of identifying which columns you are trying to match to your odds data.由于您的存在/不存在数据框中可能有不同数量的代表变量的列，您将需要某种方法来确定您尝试将哪些列与您的赔率数据匹配。 In your example data, this could be according to variable names, using string matching:在您的示例数据中，这可能是根据变量名称，使用字符串匹配：

var_cols <- grep("^Var\\d+$", names(presence_df))

or simply according to which columns you know to be those of interest:或者只是根据您知道哪些列是感兴趣的列：

var_cols <- 3:7

Or, if the first two columns are always the name and dependent variable, then a more general solution would be:或者，如果前两列始终是名称和因变量，那么更通用的解决方案是：

var_cols <- 3:ncol(presence_df)

In this case, any of these three ways of generating var_cols will give you the vector c(3, 4, 5, 6, 7) .在这种情况下，这三种生成var_cols的方法中的任何一种都会为您提供向量c(3, 4, 5, 6, 7) 。

You now need to know which odds apply to which columns, so we match the columns of our presence / absence data frame with the correct row in the odds data frame like this:您现在需要知道哪些赔率适用于哪些列，因此我们将存在/缺席数据框的列与赔率数据框中的正确行match ，如下所示：

var_rows <- match(names(presence_df)[var_cols], odds_df$`Variable Name`)

Now we are in a position to get the vector of odds we need to apply to our columns:现在我们在 position 中获取我们需要应用于列的赔率向量：

odds_vec <- odds_df$Odds[var_rows]

To get the correct odds that should apply for each entry given its presence or absence, we need to remember that since odds are multiplicative, any absent variable should be given a value of 1, not 0, so that it does not affect the odds calculation.为了获得应该适用于每个条目的正确赔率（给定存在或不存在），我们需要记住，由于赔率是乘法的，因此任何不存在的变量都应该被赋予值 1，而不是 0，这样它就不会影响赔率计算. This means we should multiply each row of our presence / absence data by the log odds and exponentiate the result to get the actual odds.这意味着我们应该将存在/缺席数据的每一行乘以对数赔率，并对结果取幂以获得实际赔率。

We can do this row-wise using apply :我们可以使用apply逐行执行此操作：

res <- t(apply(presence_df[var_cols], 1, function(x) exp(x * log(odds_vec))))

res
#>      Var2 Var3 Var4 Var5 Var6
#> [1,]    1 1.42  1.0 1.00 1.01
#> [2,]    1 1.00  1.0 0.56 1.01
#> [3,]    1 1.00  1.0 1.00 1.00
#> [4,]    1 1.00  2.1 0.56 1.01
#> [5,]    1 1.00  1.0 1.00 1.01

Now we can write this back into our presence / absence data frame like so:现在我们可以将它写回到我们的存在/不存在数据框中，如下所示：

presence_df[var_cols] <- res

presence_df
#>   Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6
#> 1          Dog                 1    1 1.42  1.0 1.00 1.01
#> 2          Cat                 1    1 1.00  1.0 0.56 1.01
#> 3     Elephant                 0    1 1.00  1.0 1.00 1.00
#> 4         Bear                 1    1 1.00  2.1 0.56 1.01
#> 5       Jackal                 0    1 1.00  1.0 1.00 1.01

The final step is to calculate the odds implied by the presence or absence variables, which is simply the row-wise product of these odds:最后一步是计算存在或不存在变量所隐含的几率，这只是这些几率的逐行乘积：

presence_df$odds <- apply(presence_df[var_cols], 1, prod)

presence_df
#>   Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6    odds
#> 1          Dog                 1    1 1.42  1.0 1.00 1.01 1.43420
#> 2          Cat                 1    1 1.00  1.0 0.56 1.01 0.56560
#> 3     Elephant                 0    1 1.00  1.0 1.00 1.00 1.00000
#> 4         Bear                 1    1 1.00  2.1 0.56 1.01 1.18776
#> 5       Jackal                 0    1 1.00  1.0 1.00 1.01 1.01000

It might also be helpful to convert the odds into probabilities, so that we can see the probability of the dependent variable being present given the predictor variables:将几率转换为概率也可能会有所帮助，这样我们就可以在给定预测变量的情况下看到因变量存在的概率：

presence_df$prob <- presence_df$odds / (1 + presence_df$odds)

presence_df
#>   Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6    odds      prob
#> 1          Dog                 1    1 1.42  1.0 1.00 1.01 1.43420 0.5891874
#> 2          Cat                 1    1 1.00  1.0 0.56 1.01 0.56560 0.3612672
#> 3     Elephant                 0    1 1.00  1.0 1.00 1.00 1.00000 0.5000000
#> 4         Bear                 1    1 1.00  2.1 0.56 1.01 1.18776 0.5429115
#> 5       Jackal                 0    1 1.00  1.0 1.00 1.01 1.01000 0.5024876

** Data from question in reproducible format ** ** 来自问题的数据以可重复的格式 **

odds_df <- structure(list(`Variable Name` = c("Var2", "Var3", "Var4", "Var5", 
"Var6"), Odds = c(0.87, 1.42, 2.1, 0.56, 1.01)), row.names = c(NA, 
-5L), class = "data.frame")

presence_df <- structure(list(`Subject Name` = c("Dog", "Cat", "Elephant", 
"Bear", "Jackal"), `Var 1 (Predictor)` = c(1L, 1L, 0L, 1L, 0L), Var2 = c(0L, 
0L, 0L, 0L, 0L), Var3 = c(1L, 0L, 0L, 0L, 0L), Var4 = c(0L, 0L, 
0L, 1L, 0L), Var5 = c(0L, 1L, 0L, 1L, 0L), Var6 = c(1L, 1L, 0L, 
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))

将新列添加到一个 Dataframe 基于另一个 R 中的条件的值和函数

问题描述

1 个解决方案

解决方案1
0 2022-09-15 09:10:39

将新列添加到一个 Dataframe 基于另一个 R 中的条件的值和函数

问题描述

1 个解决方案

解决方案1 0 2022-09-15 09:10:39

解决方案1
0 2022-09-15 09:10:39