简体   繁体   English

如何在R编程中将成对相似性写入文件

[英]How to write pairwise similarity to an file in R programming

I have a tab file: string.tab , which contain some think like below 我有一个选项卡文件: string.tab ,其中包含一些如下所示的想法

Entry   String
Dog  cube;funny;smart
Cat  tiny;cube;black
....

I also have an package Sim() to compute the similarity between two vectors of string. 我还有一个包Sim()来计算两个字符串向量之间的相似度。 For example 例如

# here is demo to show how Sim() works
a = c("cs", "funny")
b = c("math", "cool")
score = Sim(a,b)
# output: 0.156

The details of Sim() is not important. Sim()的详细信息并不重要。 It's just a text mining tool. 它只是一个文本挖掘工具。

Here is my code: 这是我的代码:

data <- read.table("string.tab", sep="\t", header=TRUE)

Now the string.tab in stored in data. 现在将string.tab存储在数据中。

My goal is to compute all pairwise similarity of Entry in string.tab 我的目标是计算string.tab中Entry的所有成对相似度

The output file result should be something like: 输出文件的result应类似于:

Entry1 Entry2 Score
dog cat 0.132
...
...

What is the fast way to do that? 快速的方法是什么?

You can use combn(seq(nrow(df)), 2) to get all length 2 combinations (pairs) of row numbers. 您可以使用combn(seq(nrow(df)), 2)获得所有长度为2的行号组合(对)。 Then you can apply through these pairs a function which creates a data.frame for that pair, and then rbind the results together. 然后你就可以apply通过这些对其中创建该对的data.frame函数,然后rbind在一起的结果。

You can then save this as an R data file with saveRDS , as a CSV with write.csv , etc. 您可以将此然后保存为R与数据文件saveRDS ,如用CSV write.csv等。

df[] <- lapply(df, as.character)

splits <- lapply(df$String, strsplit, ';')

pairs <- 
apply(utils::combn(seq(nrow(df)), 2), 2, function(x){
  data.frame(Entry1 = df$Entry[x[1]], Entry2 = df$Entry[x[2]], 
             Score = do.call(Sim, splits[x]))
})

pairs.df <- do.call(rbind, pairs)
pairs.df
#   Entry1 Entry2     Score
# 1    Dog    Cat 0.5791888
# 2    Dog  human 0.7434178
# 3    Cat  human 0.4850377

saveRDS(pairs.df, '/path/to/save/file.RDS')
#or
write.csv(pairs.df, '/path/to/save/file.csv')

Data / Functions Used 使用的数据/功能

df <- structure(list(Entry = structure(c(2L, 1L, 3L), .Label = c("Cat", "Dog", "human"), class = "factor"), String = structure(c(1L, 3L, 2L), .Label = c("cube;funny;smart", "man;women", "tiny;cube;black" ), class = "factor")), row.names = c(NA, 3L), class = "data.frame")


Sim <- function(x, y){ # example since I don't have real Sim
  set.seed(sum(nchar(x) + nchar(y)))
  runif(1)
}

The most straight forward thing would be to write a for loop. 最直接的方法是编写一个for循环。

I'm going to assume you have some method/function to make a vector from data$String given an index. 我将假设您有一些方法/函数可以从给定索引的data$String创建一个向量。 In this example i'll name the function extract() 在此示例中,我将函数命名为extract()

 l <- nrow(data)
 n <- choose(l, 2) # number of combinations made
 entry <- data$entry
 result <- data.frame(Entry1 = rep("",n), Entry2 = rep("",n), Score = rep(0,n))
#make combination data
.comb <- data.frame(Entry1 = rep(0,n), Entry2 = rep(0,n))

#Entry1 list
.comb$Entry1 <- unlist(mapply(FUN = rep, x = 1:l, times = (l-1):0))
#Entry2 list
.c <- c(2:l)
if(l>2){
  for(i in 3:l) {
  .c <- c(.c,i:l)
  }
}
.comb$Entry2 <- .c


for(i in 1:n) {
   result[i,"Entry1"] <- data$Entry1[.comb[i,"Entry1"]]
   result[i,"Entry2"] <- data$Entry2[.comb[i,"Entry2"]]
   e.1 <- extract(data$String[.comb[i,"Entry1"]])
   e.2 <- extract(data$String[.comb[i,"Entry2"]])
   result[i, "Score"] <- Sim(e.1,e.2)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM