I'm trying to write a piece of optimization code in R to calculate a set of unknown values for a biological question concerning drosophila flies.
The dataframe is composed of 13 columns (just showing 9 in code below for clarity) with varying numbers of rows. The first three columns contain collected data and the rest of the columns are calculated using various formulas. Two of the columns, Missing_C and Missing_D are initially populated with empty data and in the optimization problem represent the initial values.
Time.min. Prob_C Prob_D Miss_C Miss_D Event_C Event_D Risk_C Risk_D
1 0 1.00 1.00 0 0 0.00 0.00 86.00 78.00
2 5 0.98 0.97 0 0 1.93 1.98 84.07 76.02
3 16 0.84 0.95 0 0 10.67 1.90 73.40 74.12
4 17 0.50 0.75 0 0 21.02 12.85 52.38 61.27
5 20 0.30 0.50 0 0 14.97 15.32 37.42 45.95
As an example of the some of the formulas used, Event_C and Risk_C are calculated with a for loop as follows:
#define values for events_c and risk_c with for loops`
temp_events_c <-vector()
temp_risk_c <-vector()
for (i in 2:no_rows) {
temp_events_c <- ((prob_c[i] * risk_c[i-1]) - (prob_c[i] * miss_c[i-1]) - (prob_c[i-1] * risk_c[i-1]) + (prob_c[i-1] * miss_c[i-1])) / (prob_c[i] - (2 * prob_c[i-1]))
events_c[i] <- temp_events_c
temp_risk_c <- risk_c[i-1] - miss_c[i-1] - events_c[i]
risk_c[i] <- temp_risk_c
}
From this data, I also have a single, collected, value (9.1 in this instance) which relates to the values in the table. The following function defines the relationship to this value to columns Event_C, Event_D and two columns not shown in the above, Expected_C and Expected_D where the sums of those columns are represented by x[1], x[2], x[3], x[4]:
fn <- function(x) ((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]
I then would like to use a minimization algorithm ie slsqp
from nloptr
to calculate the values in the Miss_C and Miss_D which ultimately satisfy this single value. The extra code for the optimization would be something like this:
x0 <- c(Miss_C,Miss_D)
heq <- function(x) (((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]) - 9.1 # heq == 0
slsqp(x0, fn, gr = NULL,
hin = NULL, heq = heq)
Obviously, that doesn't work as the initial values are not directly contained within the function that needs to be solved and that is the point I am stuck at! I'm not sure if this is an optimization problem as such or more of a general R coding question - either way, any help would be much appreciated.
Cheers, Andrew
* Edit - complete code added as per request *
#input variables
time_vector <- c(0,5,16,17,20)
prob_c <- c(1,0.977,0.835,0.5,0.30)
prob_d <- c(1,0.974,0.949,0.75,0.50)
miss_c <- c(0,0,0,0,0)
miss_d <- c(0,0,0,0,0)
#get number of rows
no_rows <- length(time_vector)
#fill events columns with dummy data
events_c <- c(0:(no_rows - 1))
events_d <- c(0:(no_rows - 1))
#define starting number at risk
risk_c_t0 <- 86
risk_d_t0 <- 78
#add t0 risk to each column
risk_c <- risk_c_t0
risk_d <-risk_d_t0
#fill risk columns with dummy data
risk_c[2:no_rows] <- c(2:no_rows)
risk_d[2:no_rows] <- c(2:no_rows)
#re-define values for events_c and risk_c with for loops
temp_events_c <-vector()
temp_risk_c <-vector()
for (i in 2:no_rows) {
temp_events_c <- ((prob_c[i] * risk_c[i-1]) - (prob_c[i] * miss_c[i-1]) - (prob_c[i-1] * risk_c[i-1]) + (prob_c[i-1] * miss_c[i-1])) / (prob_c[i] - (2 * prob_c[i-1]))
events_c[i] <- temp_events_c
temp_risk_c <- risk_c[i-1] - miss_c[i-1] - events_c[i]
risk_c[i] <- temp_risk_c
}
#re-define values for events_t with for loops
temp_events_d <-vector()
temp_risk_d <-vector()
for (j in 2:no_rows) {
temp_events_d <- ((prob_d[j] * risk_d[j-1]) - (prob_d[j] * miss_d[j-1]) - (prob_d[j-1] * risk_d[j-1]) + (prob_d[j-1] * miss_d[j-1])) / (prob_d[j] - (2 * prob_d[j-1]))
events_d[j] <- temp_events_d
temp_risk_d <- risk_d[j-1] - miss_d[j-1] - events_d[j]
risk_d[j] <- temp_risk_d
}
#calculate total risk, events and expected
total_risk <- risk_c + risk_d
total_events <- events_c + events_d
expected_c <- (risk_c * (total_events/total_risk))
expected_d <- (risk_d * (total_events/total_risk))
#place values into dataframe
df1 <- data.frame(time_vector,prob_c,prob_d, miss_c, miss_d, events_c, events_d, risk_c, risk_d, total_risk, total_events, expected_c, expected_d)
#sum of values
sum_events_C <- sum(events_c)
sum_events_d <- sum(events_d)
sum_expected_c <- sum(expected_c)
sum_expected_d <- sum(expected_d)
#chi_sq formula
chi_sq_combo <- (((sum_events_C - sum_expected_c)^2)/sum_expected_c) + (((sum_events_d - sum_expected_d)^2)/sum_expected_d)
#### end of table calculations before sim
#x <- c(sum_events_C, sum_expected_c, sum_events_d, sum_expected_d)
#x0 <- c(miss_c,miss_d) #inital values
#fn <- function(x) ((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]
#heq <- function(x) (((x[1]-x[2])^2)/x[2] + ((x[3]-x[4])^2)/x[4]) - 6.5 # heq == 0
#slsqp(x0, fn, gr = NULL,
# hin = NULL, heq = heq)
Rephrasing the comments above, I believe the problem is to use optimization to find two values which yield a target chi-square value. A complication which may cause problems is that there are likely many solutions that yield the target, so it might be necessary to add some other requirement to make the answer unique.
To do this, you need a function of two variables which calculates the square of the difference between the chi-square value using those variables and the target value, and then you minimize that.
For example,
fn2 <- function(x) {
c <- x[1]
d <- x[2]
chisq <- (((c - sum_expected_c)^2)/sum_expected_c) +
(((d - sum_expected_d)^2)/sum_expected_d)
(chisq - 6.5)^2
}
for (i in 1:no_rows) {
x0 <- c(df1$miss_c[i],df1$miss_d[i]) #initial values
res <- nloptr::slsqp(x0, fn2)
miss_c[i] <- res$par[1]
miss_d[i] <- res$par[2]
}
This gives the same values all 5 times, so I might not have understood you completely.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.