[英]Add New Columns to One Dataframe Based on Values and Functions in Another With Conditions in R
A tricky one for you.对你来说是一个棘手的问题。 I have two data frames, one that is a list of odd ratios (skipping the first as it is our predictor).
我有两个数据框,一个是奇数比率列表(跳过第一个,因为它是我们的预测器)。 See below:
见下文:
Variable Name![]() |
Odds![]() |
---|---|
Var2![]() |
0.87 ![]() |
Var3![]() |
1.42 ![]() |
Var4![]() |
2.10 ![]() |
Var5![]() |
0.56 ![]() |
Var6![]() |
1.01 ![]() |
The second is a list of subjects, with the variables as columns and whether it is present as a binary flag (0/1)
.第二个是主题列表,变量作为列,它是否以二进制标志
(0/1)
的形式出现。 See below:见下文:
Subject Name![]() |
Var 1 (Predictor) ![]() |
Var 2![]() |
Var 3![]() |
Var 4![]() |
Var 5![]() |
Var 6![]() |
---|---|---|---|---|---|---|
Dog![]() |
1 ![]() |
0 ![]() |
1 ![]() |
0 ![]() |
0 ![]() |
1 ![]() |
Cat![]() |
1 ![]() |
0 ![]() |
0 ![]() |
0 ![]() |
1 ![]() |
1 ![]() |
Elephant![]() |
0 ![]() |
0 ![]() |
0 ![]() |
0 ![]() |
0 ![]() |
0 ![]() |
Bear![]() |
1 ![]() |
0 ![]() |
0 ![]() |
1 ![]() |
1 ![]() |
1 ![]() |
Jackal![]() |
0 ![]() |
0 ![]() |
0 ![]() |
0 ![]() |
0 ![]() |
1 ![]() |
What I now need to do is add that x
number of new variables by their odds ratio if present.我现在需要做的是通过它们的优势比添加
x
个新变量(如果存在)。 For example.例如。
Dog
has 2 present predictors. Dog
有 2 个当前预测变量。 Var 3
and Var 6
so we want to multiple Var3=1.42 * Var6=1.01
= 1.43 leaving the others blank. Var 3
和Var 6
所以我们想要Var3=1.42 * Var6=1.01
= 1.43 将其他的留空。 For Jackal
we have one predictor in Var 6=1.01
leaving the others blank.对于
Jackal
,我们在Var 6=1.01
中有一个预测变量,其他变量留空。 Adding these in as individual variables and their total multiplication is prefered.将这些作为单个变量添加,并且它们的总乘法是首选。
The number of variables may change from five, too six or seven, so specifying names for the function will not work.变量的数量可能从五个、六个或七个变化,因此为 function 指定名称将不起作用。 It needs to be based on the number of initial variables (-1 for the predictor).
它需要基于初始变量的数量(预测变量为-1)。
Thus far I have tried writing a complex ifelse
statement but it doesn't work for the dynamic range.到目前为止,我已经尝试编写一个复杂的
ifelse
语句,但它不适用于动态范围。 Something on matching names which get added in with a prefix / suffix?在匹配名称中添加前缀/后缀的内容? Or by position?
还是通过 position? I am genuinely stumped in doing this the most efficient way.
我真的很难以最有效的方式做到这一点。 Hope it's clear, happy to provide more detail if required / requested.
希望很清楚,如果需要/要求,很乐意提供更多详细信息。
Let us assume that your first data frame is called odds_df
and your second is called presence_df
.让我们假设您的第一个数据框称为
odds_df
,而您的第二个数据框称为presence_df
。 I have taken the data from your question and made reproducible versions of these at the bottom of this answer.我已从您的问题中获取数据,并在此答案的底部制作了可重复的版本。
As you may have different numbers of columns representing variables in your presence / absence data frame, you will need some way of identifying which columns you are trying to match to your odds data.由于您的存在/不存在数据框中可能有不同数量的代表变量的列,您将需要某种方法来确定您尝试将哪些列与您的赔率数据匹配。 In your example data, this could be according to variable names, using string matching:
在您的示例数据中,这可能是根据变量名称,使用字符串匹配:
var_cols <- grep("^Var\\d+$", names(presence_df))
or simply according to which columns you know to be those of interest:或者只是根据您知道哪些列是感兴趣的列:
var_cols <- 3:7
Or, if the first two columns are always the name and dependent variable, then a more general solution would be:或者,如果前两列始终是名称和因变量,那么更通用的解决方案是:
var_cols <- 3:ncol(presence_df)
In this case, any of these three ways of generating var_cols
will give you the vector c(3, 4, 5, 6, 7)
.在这种情况下,这三种生成
var_cols
的方法中的任何一种都会为您提供向量c(3, 4, 5, 6, 7)
。
You now need to know which odds apply to which columns, so we match
the columns of our presence / absence data frame with the correct row in the odds data frame like this:您现在需要知道哪些赔率适用于哪些列,因此我们将存在/缺席数据框的列与赔率数据框中的正确行
match
,如下所示:
var_rows <- match(names(presence_df)[var_cols], odds_df$`Variable Name`)
Now we are in a position to get the vector of odds we need to apply to our columns:现在我们在 position 中获取我们需要应用于列的赔率向量:
odds_vec <- odds_df$Odds[var_rows]
To get the correct odds that should apply for each entry given its presence or absence, we need to remember that since odds are multiplicative, any absent variable should be given a value of 1, not 0, so that it does not affect the odds calculation.为了获得应该适用于每个条目的正确赔率(给定存在或不存在),我们需要记住,由于赔率是乘法的,因此任何不存在的变量都应该被赋予值 1,而不是 0,这样它就不会影响赔率计算. This means we should multiply each row of our presence / absence data by the log odds and exponentiate the result to get the actual odds.
这意味着我们应该将存在/缺席数据的每一行乘以对数赔率,并对结果取幂以获得实际赔率。
We can do this row-wise using apply
:我们可以使用
apply
逐行执行此操作:
res <- t(apply(presence_df[var_cols], 1, function(x) exp(x * log(odds_vec))))
res
#> Var2 Var3 Var4 Var5 Var6
#> [1,] 1 1.42 1.0 1.00 1.01
#> [2,] 1 1.00 1.0 0.56 1.01
#> [3,] 1 1.00 1.0 1.00 1.00
#> [4,] 1 1.00 2.1 0.56 1.01
#> [5,] 1 1.00 1.0 1.00 1.01
Now we can write this back into our presence / absence data frame like so:现在我们可以将它写回到我们的存在/不存在数据框中,如下所示:
presence_df[var_cols] <- res
presence_df
#> Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6
#> 1 Dog 1 1 1.42 1.0 1.00 1.01
#> 2 Cat 1 1 1.00 1.0 0.56 1.01
#> 3 Elephant 0 1 1.00 1.0 1.00 1.00
#> 4 Bear 1 1 1.00 2.1 0.56 1.01
#> 5 Jackal 0 1 1.00 1.0 1.00 1.01
The final step is to calculate the odds implied by the presence or absence variables, which is simply the row-wise product of these odds:最后一步是计算存在或不存在变量所隐含的几率,这只是这些几率的逐行乘积:
presence_df$odds <- apply(presence_df[var_cols], 1, prod)
presence_df
#> Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6 odds
#> 1 Dog 1 1 1.42 1.0 1.00 1.01 1.43420
#> 2 Cat 1 1 1.00 1.0 0.56 1.01 0.56560
#> 3 Elephant 0 1 1.00 1.0 1.00 1.00 1.00000
#> 4 Bear 1 1 1.00 2.1 0.56 1.01 1.18776
#> 5 Jackal 0 1 1.00 1.0 1.00 1.01 1.01000
It might also be helpful to convert the odds into probabilities, so that we can see the probability of the dependent variable being present given the predictor variables:将几率转换为概率也可能会有所帮助,这样我们就可以在给定预测变量的情况下看到因变量存在的概率:
presence_df$prob <- presence_df$odds / (1 + presence_df$odds)
presence_df
#> Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6 odds prob
#> 1 Dog 1 1 1.42 1.0 1.00 1.01 1.43420 0.5891874
#> 2 Cat 1 1 1.00 1.0 0.56 1.01 0.56560 0.3612672
#> 3 Elephant 0 1 1.00 1.0 1.00 1.00 1.00000 0.5000000
#> 4 Bear 1 1 1.00 2.1 0.56 1.01 1.18776 0.5429115
#> 5 Jackal 0 1 1.00 1.0 1.00 1.01 1.01000 0.5024876
** Data from question in reproducible format ** ** 来自问题的数据以可重复的格式 **
odds_df <- structure(list(`Variable Name` = c("Var2", "Var3", "Var4", "Var5",
"Var6"), Odds = c(0.87, 1.42, 2.1, 0.56, 1.01)), row.names = c(NA,
-5L), class = "data.frame")
presence_df <- structure(list(`Subject Name` = c("Dog", "Cat", "Elephant",
"Bear", "Jackal"), `Var 1 (Predictor)` = c(1L, 1L, 0L, 1L, 0L), Var2 = c(0L,
0L, 0L, 0L, 0L), Var3 = c(1L, 0L, 0L, 0L, 0L), Var4 = c(0L, 0L,
0L, 1L, 0L), Var5 = c(0L, 1L, 0L, 1L, 0L), Var6 = c(1L, 1L, 0L,
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.