I am writing a piece of R code and got stuck.
Background (which is not necessary for solving the problem): I am calculating the joint probability by multiplying independent marginal distributions. The marginal probability vectors are generated by the ProbGenerationProcess() iteratively. At each iteration it will output a vector, eg.
Iteration 1:
Color =
Blue Green
0.2 0.8
Iteration 2:
Material =
Cotton Silk
0.7 0.3
Iteration 3:
Country =
China USA
0.6 0.4
......
Desired result: I want the resulted joint probability to be the product of every single element in each marginal vector. The format should look like this.
Color Material Country Prob
Blue Cotton China 0.084 (= 0.2*0.7*0.6)
Blue Cotton USA 0.056 (= 0.2*0.7*0.4)
Blue Silk China 0.036 (= 0.2*0.3*0.6)
Blue Silk USA ..
Green Cotton China ..
Green Cotton USA ..
... ... ... ...
My Implementation: Here's my code:
joint.names = NULL # data.from store the marginal value names
joint.probs = NULL # store probabilities.
for (i in iterations) {
marginal = ProbGenerationProcess(VarUniqueToIteration) # output is numeric with names
if ( is.null(joint.names) ) {
# initialize the dataframes
joint.names = names(marginal)
joint.probs = marginal
} else {
# (my hope:) iteratively populate the joint.names and joint.probs
joint.names = expand.grid(joint.names, names(marginal))
expanded.prob = expand.grid(joint.probs, marginal)
joint.probs = expanded.prob$Var1 * expanded.prob$Var2 # Row-by-row multiplication.
}
}
Output: Joint.probs turnout out to be always correct, However, joint.names doesn't quite work the way I wanted. After the first two iterations everything works well. I got:
joint.names =
Var1 Var2
1 Blue Cotton
2 Green Cotton
3 Blue Silk
4 Green Silk
... ...
Start from the third iteration it become problematic:
joint.names =
Var1.Var1 Var1.Var2 Var1.Var1.1 Var1.Var2.1 Var2
1 Blue Cotton Blue Cotton China
2 Green Cotton Green Cotton China
3 Blue Silk Blue Silk USA
4 Green Silk Green Silk USA
I guess my first question is: is this the most efficient way to get the result I wanted? If so, is expand.grid() the function I should be using, and how should I initialize it correctly?
Any help is appreciated!
Merge is your friend.
color <- data.frame(color=c("blue","green"),prob=c(0.2,0.8))
material <- data.frame(material=c("cotton","silk"),prob=c(0.7,0.3))
country <- data.frame(country=c("china","usa"),prob=c(0.6,0.4))
dat <- merge(merge(color[1],material[1]),country[1]) # get names first
# same as: expand.grid(c("china","usa"),c("cotton","silk"),c("blue","green"))
dat <- merge(dat, color, by="color")
dat <- merge(dat, material, by="material")
dat <- merge(dat, country, by="country")
dat$joint <- dat$prob.x * dat$prob.y * dat$prob # joint calc
dat <- dat[-grep("^prob",colnames(dat))] # cleanup extra probs
Result:
country material color joint
1 china cotton blue 0.084
2 china cotton green 0.336
3 china silk blue 0.036
4 china silk green 0.144
5 usa cotton blue 0.056
6 usa cotton green 0.224
7 usa silk blue 0.024
8 usa silk green 0.096
How about this for simlicity (although if performance is an issue, maybe better with merge)
PROBS<-data.frame(Item=rep(c("Color","Material","Country"),each=2),
Value=c("Blue","Green","Cotton","Silk","China","USA"),
Prob=c(0.2,0.8,0.7,0.3,0.6,0.4))
rownames(PROBS)<-PROBS$Value
GRID<-expand.grid(by(PROBS,PROBS$Item,function(x)x["Value"]))
GRID$probs<-apply(GRID,1,function(x)prod(PROBS[c(x),"Prob"]))
GRID
# Color Country Material probs
#1 Blue China Cotton 0.084
#2 Green China Cotton 0.336
#3 Blue USA Cotton 0.056
#4 Green USA Cotton 0.224
#5 Blue China Silk 0.036
#6 Green China Silk 0.144
#7 Blue USA Silk 0.024
#8 Green USA Silk 0.096
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.