简体   繁体   中英

How to compare two columns in different data.frames within R

I am working on my first real project within R and ran into a problem. I am trying to compare 2 columns within 2 different data.frames. I tried running the code,

matrix1 = matrix
for (i in 1:2000){
  if(data.QW[i,1] == data.RS[i,1]){
    matrix1[i,1]== "True"
  }
  else{
    matrix1[i,1]== "False"
  }
}

I got this error:

Error in Ops.factor(data.QW[i,1], data.RS[i,1]) : 
  level sets of factors are different

I think this may be because QW and RS have different row lengths. But I am trying to see where these errors might be within the different data.frames and fix them according to the source document.

I am also unsure if matrix will work for this or if I need to make it into a vector and rbind it into the matrix every time.

Any good readings on this would also be appreciated.

As mentioned in the comments, providing a reproducible example with the contents of the dataframe will be helpful.

Going by how the question topic sounds, it appears that you want to compare column 1 of data frame A against column 1 of data frame B and store the result in a logical vector. If that summary is accurate, please take a look here .

Too long for a comment.

Some observations:

  1. Your columns, data.QW[,1] and data.RS[,1] are almost certainly factors.
  2. The factors almost certainly have different set of levels (it's possible that one of the factors has a subset of the levels in the other factor). When this happens, comparisons using == will not work.
  3. If you read your data into these data.frames using something like read.csv(...) any columns containing character data were converted to factors by default. You can change that behavior by setting stringsAsFactors=FALSE in the call to read.csv(...) . This is a very common problem.
  4. Once you've sorted out the factors/levels problem, you can avoid the loop by using, simply: data.QW[1:2000,1]==data.RW[1:2000,1] . This will create a vector of length 2000 containing all the comparisons. No loop needed. Of course this assumes that both data.frames have at least 2000 rows.

Here's an example of item 2:

x <- as.factor(rep(LETTERS[1:5],3))   # has levels: A, B, C, D, E
y <- as.factor(rep(LETTERS[1:3],5))   # has levels: A, B, C
y==x
# Error in Ops.factor(y, x) : level sets of factors are different

The below function compare compares data.frames or matrices a,b to find row matches of a in b . It returns the first row position in b which matches (after some internal sorting required to speed thinks up). Rows in a which have no match in b will have a return value of 0 . Should handle numeric, character and factor column types and mixtures thereof (the latter for data.frames only). Check the example below the function definition.

compare<-function(a,b){

    #################################################
    if(dim(a)[2]!=dim(b)[2]){
        stop("\n Matrices a and b have different number of columns!")
    }
    if(!all(sapply(a, class)==sapply(b, class))){
        stop("\n Matrices a and b have incomparable column data types!")    
    }
    #################################################
    if(is.data.frame(a)){
        i <- sapply(a, is.factor)
        a[i] <- lapply(a[i], as.character)
    }
    if(is.data.frame(b)){
        i <- sapply(b, is.factor)
        b[i] <- lapply(b[i], as.character)
    }
    len1<-dim(a)[1]
    len2<-dim(b)[1]
    ord1<-do.call(order,as.data.frame(a))
    a<-a[ord1,]
    ord2<-do.call(order,as.data.frame(b))
    b<-b[ord2,]     
    #################################################
    found<-rep(0,len1)  
    dims<-dim(a)[2]
    do_dims<-c(1:dim(a)[2]) 
    at<-1
    for(i in 1:len1){
        for(m in do_dims){
            while(b[at,m]<a[i,m]){
                at<-(at+1)      
                if(at>len2){break}              
            }
            if(at>len2){break}
            if(b[at,m]>a[i,m]){break}
            if(m==dims){found[i]<-at}
        }
        if(at>len2){break}
    }
    #################################################
    found<-found[order(ord1)]
    found<-ord2[found]
    return(found)

}
# example data sets:
ncols<-10
nrows<-1E4
a <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- matrix(sample(LETTERS,size = (ncols*nrows), replace = T), ncol = ncols, nrow = nrows)
b <- rbind(a,b) # example of b containing a
b <- b[sample(dim(b)[1],dim(b)[1],replace = F),] 
found<-compare(a,b)

a<-as.data.frame(a) # = conversion to factors
b<-as.data.frame(b) # = conversion to factors
found<-compare(a,b)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM