[英]How to write pairwise similarity to an file in R programming
I have a tab file: string.tab
, which contain some think like below 我有一个选项卡文件:
string.tab
,其中包含一些如下所示的想法
Entry String
Dog cube;funny;smart
Cat tiny;cube;black
....
I also have an package Sim()
to compute the similarity between two vectors of string. 我还有一个包
Sim()
来计算两个字符串向量之间的相似度。 For example 例如
# here is demo to show how Sim() works
a = c("cs", "funny")
b = c("math", "cool")
score = Sim(a,b)
# output: 0.156
The details of Sim()
is not important. Sim()
的详细信息并不重要。 It's just a text mining tool. 它只是一个文本挖掘工具。
Here is my code: 这是我的代码:
data <- read.table("string.tab", sep="\t", header=TRUE)
Now the string.tab
in stored in data. 现在将
string.tab
存储在数据中。
My goal is to compute all pairwise similarity of Entry in string.tab
我的目标是计算
string.tab
中Entry的所有成对相似度
The output file result
should be something like: 输出文件的
result
应类似于:
Entry1 Entry2 Score
dog cat 0.132
...
...
What is the fast way to do that? 快速的方法是什么?
You can use combn(seq(nrow(df)), 2)
to get all length 2 combinations (pairs) of row numbers. 您可以使用
combn(seq(nrow(df)), 2)
获得所有长度为2的行号组合(对)。 Then you can apply
through these pairs a function which creates a data.frame for that pair, and then rbind
the results together. 然后你就可以
apply
通过这些对其中创建该对的data.frame函数,然后rbind
在一起的结果。
You can then save this as an R
data file with saveRDS
, as a CSV with write.csv
, etc. 您可以将此然后保存为
R
与数据文件saveRDS
,如用CSV write.csv
等。
df[] <- lapply(df, as.character)
splits <- lapply(df$String, strsplit, ';')
pairs <-
apply(utils::combn(seq(nrow(df)), 2), 2, function(x){
data.frame(Entry1 = df$Entry[x[1]], Entry2 = df$Entry[x[2]],
Score = do.call(Sim, splits[x]))
})
pairs.df <- do.call(rbind, pairs)
pairs.df
# Entry1 Entry2 Score
# 1 Dog Cat 0.5791888
# 2 Dog human 0.7434178
# 3 Cat human 0.4850377
saveRDS(pairs.df, '/path/to/save/file.RDS')
#or
write.csv(pairs.df, '/path/to/save/file.csv')
Data / Functions Used 使用的数据/功能
df <- structure(list(Entry = structure(c(2L, 1L, 3L), .Label = c("Cat", "Dog", "human"), class = "factor"), String = structure(c(1L, 3L, 2L), .Label = c("cube;funny;smart", "man;women", "tiny;cube;black" ), class = "factor")), row.names = c(NA, 3L), class = "data.frame")
Sim <- function(x, y){ # example since I don't have real Sim
set.seed(sum(nchar(x) + nchar(y)))
runif(1)
}
The most straight forward thing would be to write a for loop. 最直接的方法是编写一个for循环。
I'm going to assume you have some method/function to make a vector from data$String
given an index. 我将假设您有一些方法/函数可以从给定索引的
data$String
创建一个向量。 In this example i'll name the function extract()
在此示例中,我将函数命名为
extract()
l <- nrow(data)
n <- choose(l, 2) # number of combinations made
entry <- data$entry
result <- data.frame(Entry1 = rep("",n), Entry2 = rep("",n), Score = rep(0,n))
#make combination data
.comb <- data.frame(Entry1 = rep(0,n), Entry2 = rep(0,n))
#Entry1 list
.comb$Entry1 <- unlist(mapply(FUN = rep, x = 1:l, times = (l-1):0))
#Entry2 list
.c <- c(2:l)
if(l>2){
for(i in 3:l) {
.c <- c(.c,i:l)
}
}
.comb$Entry2 <- .c
for(i in 1:n) {
result[i,"Entry1"] <- data$Entry1[.comb[i,"Entry1"]]
result[i,"Entry2"] <- data$Entry2[.comb[i,"Entry2"]]
e.1 <- extract(data$String[.comb[i,"Entry1"]])
e.2 <- extract(data$String[.comb[i,"Entry2"]])
result[i, "Score"] <- Sim(e.1,e.2)
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.