简体   繁体   English


[英]Comparing multiple rows and creating a matrix in R or in Excel

I have a file containing, multiple rows as follows 我有一个包含多个行的文件,如下所示

In file1: 在文件1中:

a  8|2|3|4   4
b  2|3|5|6|7 5
c  8|5|6|7|9 5

a to a has 4 overlaps, similarly a to b had 2 overlaps, so to check the overlaps between various entity, I need to generate a matrix with the above details, and the output should be a matrix like a到a有4个重叠,类似a到b也有2个重叠,因此要检查各个实体之间的重叠,我需要生成一个具有上述详细信息的矩阵,并且输出应为类似

  a b c
a 4 2 1
b 2 5 3
c 1 3 5

Please give me a suggestion, how to do this? 请给我一个建议,该怎么做? Is there any way to do this using excel or using a shell script or using R? 有什么办法可以使用excel或Shell脚本或R来做到这一点? I have written this following code but since I am not a good coder, I couldn't get the output printed in a right format. 我已经编写了以下代码,但是由于我不是一个好的编码人员,所以无法以正确的格式打印输出。

Newmet<-sapply(newmet2, function(x) x[2:length(x)], simplify=F )

for (i in 1:length(Newmet))
  for (j in 1:length(Newmet)
  c <- ((intersect(Newmet[[i]], Newmet[[j]]))
  print (length(c))

Edited: Thanks for all the answers.. I got the matrix using both excel and R with the help of following answers. 编辑:谢谢所有的答案。在以下答案的帮助下,我同时使用excel和R获得了矩阵。

Here is a function in R that returns the counts of each columns matches as a new matrix 这是R中的一个函数,它以新矩阵的形式返回匹配的每一列的计数

First we get your data into a R data.frame object: 首先,我们将您的数据放入R data.frame对象:

A <- c(8,2,3,4,NA)
B <- c(2,3,5,6,7)
C <- c(8,5,6,7,9)
dataset <- data.frame(A,B,C)

Then we create a function: 然后我们创建一个函数:

count_matches <- function (x) {
  if (is.data.frame(x)) {
    y <- NULL
    for (i in 1:dim(x)[2]) {
      for (j in 1:dim(x)[2]) {
        count <- sum(x[[i]][!is.na(x[i])] %in% x[[j]][!is.na(x[j])])
        y <- c(y, count)
    y <- matrix(y, dim(x)[2], )
    colnames(y) <- names(x)
    rownames(y) <- names(x)
  } else {
    print('Argument must be a data.frame')

We test the function on our dataset: 我们在数据集上测试该函数:


Which returns a matrix: 它返回一个矩阵:

  A B C
A 4 2 1
B 2 5 3
C 1 3 5

If the numbers are in separate cells starting in Sheet1!A1, try 如果数字是从Sheet1!A1开始的单独单元格中,请尝试


starting at Sheet2!A1. 从Sheet2!A1开始。

Must be entered as an array formula using Ctrl Shift Enter 必须使用Ctrl Shift 输入作为数组公式

Alternative formula that doesn't have to start at Sheet2!A1 不必从Sheet2!A1开始的替代公式



Using R: 使用R:

# dummy data
df1 <- read.table(text = "a  8|2|3|4   4
b  2|3|5|6|7 5
c  8|5|6|7|9 5", as.is = TRUE)

#   V1        V2 V3
# 1  a   8|2|3|4  4
# 2  b 2|3|5|6|7  5
# 3  c 8|5|6|7|9  5

# convert 2nd column to a splitted list
myList <- unlist(lapply(df1$V2, strsplit, split = "|", fixed = TRUE), recursive = FALSE)
names(myList) <- df1$V1
# $a
# [1] "8" "2" "3" "4"
# $b
# [1] "2" "3" "5" "6" "7"
# $c
# [1] "8" "5" "6" "7" "9"

# get overlap counts
#    ind
# ind a b c
#   a 4 2 1
#   b 2 5 3
#   c 1 3 5

If we remove data processing bit, this answer is already provided by similar post: Intersect all possible combinations of list elements 如果我们删除数据处理位,则类似的帖子已经提供了此答案: 与列表元素的所有可能组合相交

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM