[英]Extract values from a transition matrix using columns of a data.frame in R
I have a transition matrix, with the cost of going from one state to another, eg 我有一个转换矩阵,其中包含从一种状态到另一种状态的成本
cost <- data.frame( a=c("aa","ab"),b=c("ba","bb"))
(pretending that the string "aa" is the cost of moving from a to a) (假装字符串“aa”是从a移动到a的成本)
I've got a data.frame
, with states in: 我有一个
data.frame
,其状态为:
transitions <- data.frame( from=c("a","a","b"), to=c("a","b","b") )
I'd like to be able to add a column to transitions, with the cost of each transition in, so it ends up being: 我希望能够为转换添加一个列,每个转换的成本都在,所以它最终是:
from to cost
1 a a aa
2 a b ab
3 b b bb
I'm sure there is an R-ish way to do this. 我敢肯定有一种R-ish方式可以做到这一点。 I've ended up using a for loop:
我最终使用了for循环:
n <- dim(data)[1]
v <- vector("numeric",n)
for( i in 1:n )
{
z<-data[i,c(col1,col2),with=FALSE]
za <- z[[col1]]
zb <- z[[col2]]
v[i] <- dist[za,zb]
}
data <- cbind(data,d=v)
names(data)[dim(data)[2]] <- colName
data
But this feels pretty ugly, and it's incredibly slow - it takes about 20 minutes on a 2M row data.frame
(and an operation to compute distances between elements of the same table takes less than a second). 但是这感觉非常难看,并且速度非常慢 - 在2M行
data.frame
上花费大约20分钟(并且计算同一个表的元素之间的距离的操作不到一秒)。
Is there a simple, fast, one or two line command that would get me the cost column above? 是否有一个简单,快速,一个或两个行命令可以获得上面的成本列?
UPDATE: Consider known states 更新:考虑已知状态
data.table
solution: data.table
解决方案:
require(utils)
require(data.table)
## Data generation
N <- 2e6
set.seed(1)
states <- c("a","b")
cost <- data.frame(a=c("aa","ab"),b=c("ba","bb"))
transitions <- data.frame(from=sample(states, N, replace=T),
to=sample(states, N, replace=T))
## Expanded cost matrix construction
f <- expand.grid(states, states)
f <- f[order(f$Var1, f$Var2),]
f$cost <- unlist(cost)
## Prepare data.table
dt <- data.table(transitions)
setkey(dt, from, to)
## Routine itself
dt[,cost:=as.character("")] # You don't need this line if cost is numeric
apply(f, 1, function(x) dt[J(x[1],x[2]),cost:=x[3]])
With 2M rows in transitions
it takes about 0.3sec to proceed. 在
transitions
有2M行时,大约需要0.3秒才能进行。
Here's one way: (At least this works on this example and I believe it'll work on larger data as well. Please write back with an example if it doesn't) 这是一种方法:(至少此方法适用于本示例,我相信它也适用于较大的数据。如果不适用,请写出示例)
# load both cost and transition with stringsAsFactors = FALSE
# so that strings are NOT by default loaded as factors
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb"), stringsAsFactors=F)
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b"),
stringsAsFactors = FALSE)
# convert cost to vector: it'll have names a1, a2, b1, b2. we'll exploit that.
cost.vec <- unlist(cost)
# convert "to" to factor and create id column with "from" and as.integer(to)
# the as.integer(to) will convert it into its levels
transitions$to <- as.factor(transitions$to)
transitions$id <- paste0(transitions$from, as.integer(transitions$to))
# now, you'll have a1, a2 etc.. here as well, just match it with the vector
transitions$val <- cost.vec[!is.na(match(names(cost.vec), transitions$id))]
# from to id val
# 1 a a a1 aa
# 2 a b a2 ab
# 3 b b b2 bb
You can of course remove the id
. 你当然可以删除
id
。 If this wouldn't work in any case, let me know. 如果这在任何情况下都不起作用,请告诉我。 I'll try to fix it.
我会尝试解决它。
Starting from Arun's answer, I went with: 从阿伦的答案开始,我去了:
library(reshape)
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb") )
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b") )
row.names(cost) <- c("a","b") #Normally get this from the csv file
cost$from <- row.names(cost)
m <- melt(cost, id.vars=c("from"))
m$transition = paste(m$from,m$variable)
transitions$transition=paste(transitions$from,transitions$to)
merge(m, transitions, by.x="transition",by.y="transition")
It's a few more lines, but I'm a bit untrusting of factor orderings as indexes. 这是几行,但我有点不信任因子排序作为索引。 It also means that when they are data.tables, I can do:
这也意味着当它们是data.tables时,我可以这样做:
setkey(m,transition)
setkey(transitions,transition)
m[transitions]
I haven't benchmarked, but on large datasets, I'm pretty confident the data.table merge will be faster than the merge or vector scan approaches. 我尚未进行基准测试,但是对于大型数据集,我非常有信心data.table合并将比合并或矢量扫描方法快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.