简体   繁体   中英

Intersection on lists

i have 6 txt files split in 2 groups (A and T files). i want import all these files in R and intersect every A file with every T file and obtain a matrix with the ratio of A over T like in this example. I was thinking of making two lists of vectors and find a way to calculate this matrix starting from them.

A_1.txt
tomato
zucchini
potato
banana
coconut
salt
A_2.txt
tomato
zucchini
potato
A_3.txt
zucchini
potato
T_1.txt
tomato
zucchini
potato
banana
coconut
salt
T_2.txt
tomato
zucchini
potato
banana
T_3.txt
potato
banana
coconut

what i want to obtains is this matrix:

    T_1 T_2 T_3
A_1 6   4   3
A_2 3   3   1
A_3 2   2   1

Could somebody can give me a tip on how to do this in R?

I read in this information in this way:

A_files <- list.files("/home/A/", full.names = TRUE)
T_files <- list.files("/home/T/", full.names = TRUE)
myAlist <- lapply(A_files, read.delim, header=FALSE)
myTlist <- lapply(T_files, read.delim, header=FALSE)

This is what I would do with my preferred set of tools:

library(data.table)
library(magrittr)
filenames <- dir(pattern = "^[AT]_\\d.txt$") 
vec <-
  lapply(filenames, fread, header = FALSE) %>% 
  set_names(filenames %>% stringr::str_remove("\\.txt$")) %>% 
  rbindlist(idcol = "file")
vecA <- vec[file %like% "^A"]
vecT <- vec[file %like% "^T"]
vecA[vecT, on = .(V1), allow.cartesian = TRUE] %>% 
  dcast(file ~ i.file, length)
 file T_1 T_2 T_3 1: A_1 6 4 3 2: A_2 3 3 1 3: A_3 2 2 1

Explanation

  1. Supposed all files A_1.txt , A_2.txt , ..., T_2.txt , T_3.txt are stored in the same folder, the filenames are picked.
  2. All files are read into a list, the list elements are named accordingly, then they are combined into one data.table with an additional column which identifies the source of each row.
  3. Then, the two datasets are separated in vecA and vecT . (This is just for clarity and to make the code less convoluted).
  4. The two datasets are joined and the result is reshaped from long to wide format counitng the number of common elements.

The result of the join is

vecA[vecT, on = .(V1), allow.cartesian = TRUE]
 file V1 i.file 1: A_1 tomato T_1 2: A_2 tomato T_1 3: A_1 zucchini T_1 4: A_2 zucchini T_1 5: A_3 zucchini T_1 6: A_1 potato T_1 7: A_2 potato T_1 8: A_3 potato T_1 9: A_1 banana T_1 10: A_1 coconut T_1 11: A_1 salt T_1 12: A_1 tomato T_2 13: A_2 tomato T_2 14: A_1 zucchini T_2 15: A_2 zucchini T_2 16: A_3 zucchini T_2 17: A_1 potato T_2 18: A_2 potato T_2 19: A_3 potato T_2 20: A_1 banana T_2 21: A_1 potato T_3 22: A_2 potato T_3 23: A_3 potato T_3 24: A_1 banana T_3 25: A_1 coconut T_3 file V1 i.file

Reproducible data

This is a way to create the 6 input files from the sample dataset provided in the question:

library(data.table)
library(magrittr)
fread("A_1.txt
tomato
zucchini
potato
banana
coconut
salt
A_2.txt
tomato
zucchini
potato
A_3.txt
zucchini
potato
T_1.txt
tomato
zucchini
potato
banana
coconut
salt
T_2.txt
tomato
zucchini
potato
banana
T_3.txt
potato
banana
coconut", header = FALSE) %>% 
  .[, fwrite(.(V1[-1]), V1[1]), by = cumsum(V1 %like% "^[AT]_\\d.txt$")]

Here is an approach using base R commands. R defaults to creating factors from character vectors. It is important that you not allow that. Including the argument as.is=TRUE in your read.csv commands will preserve the character data. First make the data easily available:

myAlist <- list(A_1 = c("tomato", "zucchini", "potato", "banana", "coconut", 
     "salt"), A_2 = c("tomato", "zucchini", "potato"), A_3 = c("zucchini", 
     "potato"))
myTlist <- list(T_1 = c("tomato", "zucchini", "potato", "banana", "coconut", 
     "salt"), T_2 = c("tomato", "zucchini", "potato", "banana"), T_3 = c("potato", 
     "banana", "coconut"))

Now we create a function to find the intersection of two groups and compute the number of shared items:

Shared <- function(a, t) {
    length(intersect(myAlist[[a]], myTlist[[t]]))
}

We are taking each group in A and comparing it to each group in B, eg A1 with B1, B2, B3, etc:

(A <- rep(1:3, each=3))
# [1] 1 1 1 2 2 2 3 3 3
(T <- rep(1:3, 3))
# [1] 1 2 3 1 2 3 1 2 3

Finally we compute the number of shared items:

nshare <- mapply(Shared, A, T)
myTbl <- matrix(nshare, 3, byrow=TRUE, dimnames=list(A=names(myAlist), T=names(myTlist)))
myTbl
#      T
# A     T_1 T_2 T_3
#   A_1   6   4   3
#   A_2   3   3   1
#   A_3   2   2   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM