简体   繁体   中英

Search for specific interactions of the levels of two factors in R

I am looking for a way to search for a specific form of interaction between the levels of the factors constituting the rows of a dataframe.

I have a dataframe, such as this one, in which each column is an individual, and each row an observation:

     A   B   C   D   E   G   H   I  
1   NA  "1" "1" "1" "1" NA  "1" "1"
2   "2" "1" "2" "1" "1" NA  "1" "1"
3   "1" "2" "2" "1" "1" "1" "1" "2"
4   "1" "2" "2" "2" "3" "3" "4" "2"
5   "1" "1" "2" "2" "1" "2" "1" "2"

What I want to detect is the existence (or not) of combination of factor levels, such as for an x:x' and x:y' exists also a combination y:x' and y:y' . For instance here, such a combination exists for rows 2 and 3, wich I can see by using interaction or : :

> df <- structure(c(NA, "2", "1", "1", "1", "1", "1", "2", "2", "1", 
"1", "2", "2", "2", "2", "1", "1", "1", "2", "2", "1", "1", "1", 
"3", "1", NA, NA, "1", "3", "2", "1", "1", "1", "4", "1", "1", 
"1", "2", "2", "2"), .Dim = c(5L, 8L), .Dimnames = list(c("1", 
"2", "3", "4", "5"), c("A", "B", "C", "D", "E", "G", "H", "I")))
> interaction(df["2",],df["3",])
[1] 2.1  1.2  2.2  1.1  1.1  <NA> 1.1  1.2 
Levels: 1.1 2.1 1.2 2.2

as well as :

> as.factor(df["2",]):as.factor(df["3",])
[1] 2:1  1:2  2:2  1:1  1:1  <NA> 1:1  1:2 
Levels: 1:1 1:2 2:1 2:2

But, now, I would like the detection to be done automatically, so that I could put the labels of all the pairs of rows in the dataframe in which such a configuration (x:y, x:y', x':y, x':y') is detected into an edgelist for the network I want to draw afterwards (here, for instance, I would like to add a row "2","3" to the edgelist).

I have found an elaborate way to do that using Perl and regular expressions, but I wondered if there existed a way to do that in R, without using Regexp.

Edit [04/05/2013]

To avoid being unclear, here are more details about the configuration I'm looking for:

let {x,y,...} be observations of the first row
let {x',y',...} be observations of the second row
for interactions ({x,x'} and {x,y'}) does it exists interactions ({y,x'} and {y,y'})

So, to take a few examples, interactions such as:

1:1, 1:2, 2:1, 2:2 (rows 2 and 3)

or

1:1, **2:1**, **2:2**, **3:1**, **3:2**, 4:1 (rows 4 and 5)

would match, but not

1:1,1:2,1:3,1:4, 2:2 (rows 3 and 4)

or

1:1,1:2 (rows 1 and 2)

for instance.

What I have for now is a code that does what I want to do (imitated from a previous Perl script), in a tremendous amount of time (even if I add a while loop to avoid unnecessary comparisons), and using multiple loops and regexp. I was hoping for a less needlessly complicated way of doing this comparison. Here is how I do now:

df <- structure(c(NA, "2", "1", "1", "1", "1", "1", "2", "2", "1", 
"1", "2", "2", "2", "2", "1", "1", "1", "2", "2", "1", "1", "1", 
"3", "1", NA, NA, "1", "3", "2", "1", "1", "1", "4", "1", "1", 
"1", "2", "2", "2"), .Dim = c(5L, 8L), .Dimnames = list(c("1", 
"2", "3", "4", "5"), c("A", "B", "C", "D", "E", "G", "H", "I")))

"myfunction" = function(x){
    TableVariantes = as.matrix(x) ;
    #Creating the edgelist for the network
    edgelist = c(character(0),character(0)); 
    TotalVL = nrow(TableVariantes);

    for(i in 1:(TotalVL-1)){
        VLA = i;
        if(!(i+1) > TotalVL){
            for(j in (i+1):TotalVL){
                VLB = j ;
                problematic.configuration = FALSE;
                #False until proven otherwise
                interactions = interaction(as.factor(TableVariantes[VLA,]):as.factor(TableVariantes[VLB,]),drop=TRUE);
                if(nlevels(as.factor(interactions)) > 3){ 
                    #More than three configurations, let's go
                    #Testing every level of the first variant location
                    for(k in levels(as.factor(TableVariantes[VLA,]))){
                        # We create the regexp we will need afterwards. Impossible to use variables inside a regex in R.
                        searchforK = paste(k,":(.+)",sep="") 
                        if (length(grep(searchforK,levels(interactions), ignore.case = TRUE, perl = TRUE)) > 1){
                           #More than one configuration for this level of the first row
                           #capturing corresponding observations of the second row
                           second.numbers = regexec(searchforK,levels(interactions), ignore.case = TRUE)
                           second.numbers = do.call(rbind,lapply(regmatches(levels(interactions),second.numbers),`[`))
                           #Interactions with first number other than the one we are testing                           
                           invert.matches = grep(searchforK,levels(interactions), ignore.case = TRUE, perl = TRUE, value=TRUE, invert=TRUE)
                           #listing these alternative first numbers
                           alternative.first.numbers = regexec("(.+?):.+",levels(as.factor(invert.matches)), ignore.case = TRUE)
                           alternative.first.numbers = do.call(rbind,lapply(regmatches(levels(as.factor(invert.matches)),alternative.first.numbers),`[`))
                           #testing each alternative first number
                           for(l in levels(as.factor(alternative.first.numbers[,2]))){
                               #variable problems to count the problematic configurations
                               problems = 0 ;
                               #with each alternative second number
                               for(m in levels(as.factor(second.numbers[,2]))){
                                   searchforproblem = paste(l,":",m,sep="");
                                   if(length(grep(searchforproblem,invert.matches,ignore.case = TRUE, perl = TRUE)) > 0){
                                       #if it matches
                                       problems = problems + 1;
                                   }
                                   if(problems > 1){
                                       #If two possibilities at least
                                       problematic.configuration = TRUE;
                                   }
                               }
                           }
                        }
                    }
                }

            if(problematic.configuration == TRUE){
                edgelist = rbind(edgelist,c(rownames(TableVariantes)[VLA],rownames(TableVariantes)[VLB]));
                #adding a new edge to the network of conflicts !
            }
            }
        }
    }
    return(edgelist);
}

You can use the dput() function to provide example data with your question.

df <- structure(list(A = c("1", "2", "2", "1", "1", "1", NA, "2", "1", 
    "2"), B = c(NA, "2", "2", "2", "2", "1", "2", "2", "1", NA), 
    C = c("1", "2", "1", "1", NA, "1", NA, "2", "2", NA), D = c(NA, 
    NA, "2", "1", NA, "1", NA, "1", "1", NA), E = c(NA, NA, NA, 
    "2", "1", NA, "1", "2", NA, "1"), H = c(NA, NA, "1", "2", 
    NA, "1", "2", "2", NA, "1"), I = c(NA, NA, NA, NA, NA, NA, 
    "1", "1", NA, "2"), J = c("2", "1", "2", "1", "1", "2", NA, 
    "2", NA, "2"), K = c("1", "1", NA, "1", "2", "1", NA, "1", 
    "1", "1"), O = c("2", "2", "1", "2", "1", "1", NA, "2", "1", 
    NA)), .Names = c("A", "B", "C", "D", "E", "H", "I", "J", 
    "K", "O"), row.names = c(NA, -10L), class = "data.frame")

I assume that you are interested in discovering what pairs of observations (rows) have four unique interaction levels among the individuals (columns). Here is one way to work that out using for loops.

# convert your data frame to a matrix
m <- as.matrix(df)

# create another matrix to store the results
N <- dim(m)[1]
levelsmat <- matrix(NA, nrow=(N*N - N)/2, ncol=3, 
    dimnames=list(NULL, c("i", "j", "nlevels")))

# go through all possible pairs of observations
# and record the number of unique interactions
count <- 0
for(i in 1:(N-1)) {
for(j in (i+1):N) {
    count <- count + 1
    int <- interaction(m[i, ], m[j, ], drop=TRUE)
    levelsmat[count, ] <- c(i, j, length(levels(int)))
    }}

# paired observations that had 4 unique interactions
levelsmat[levelsmat[, "nlevels"]==4, ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM