简体   繁体   中英

sorting and removing duplicates in a ffdf data frame in R


Another question about ff data frames but hopefully this should be my last question about them. Have an awkward problem but first here is the code:

fda<-data.frame(c(1,2),c(3,4))
colnames(fda)<-c("col1","col2")
fdb<-data.frame(c(3,7),c(1,5))
colnames(fdb)<-c("col1","col2")
fd<-rbind(fda,fdb)
fd<-data.frame(fd)
library(ff)
library(ffbase)
fd<-as.ffdf(fd)
a<-c(10,12,13,11)
b<-c(13,15,10,14)
fd$col3<-as.ff(a)
fd$col4<-as.ff(b)
fd

The table looks like this:

col1 col2 col3 col4
    1    3   10   13
    2    4   12   15
    3    1   13   10
    7    5   11   14

The code below removes any duplicate rows.

rm(fda)
rm(fdb)
fd$dup<-duplicated.ffdf(fd)
fdfin<-subset.ffdf(fd, fd$dup == "FALSE")
fdfin<-as.ffdf(fdfin[,-5])
fdfin

If you see row 1 and row 3 are sort of duplicates but in slightly different order. I need to sort the code so that the rows match and then apply the duplicate code above or some alternative code to remove either just row 1 or row 3.

This is a small sample of a ~12,000,000 row table so I will need to do this with the ff or ffbase packages.

The following works on a normal data frame so was wondering how I could use ff functions to do the same thing:

df<-temp1[,1:2] #temp1 is a data frame
df.sort<-t(apply(df,1,sort))
sortdup<-temp1[!duplicated(df.sort),]

Let me know if there is any more information that is needed,
Cheers,
Lorcan

Look at ?ffrowapply at the ff package to do the sorting over the rows (apply(df,1,sort))

The sortdup<-temp1[!duplicated(df.sort),] will work ok in version 0.6 of the ffbase package as it allows to index based on an ff vector of logicals.

To install version 0.6 of ffbase, run the following code.

download.file(url="http://fffunctions.googlecode.com/git-history/b6fa5617810e012e5d809d77a9a99dbb25c7e6dc/output/ffbase_0.6.tar.gz", destfile="ffbase_0.6.tar.gz")
install.packages("ffbase_0.6.tar.gz", repos=NULL)

If ?ffapply does not work, you can always use this type of function for the apply part of your question:

ffdfapply <- function(X, MARGIN, FUN, ...){
    ## Currently only handles return types ffdf and ff_vector
    stopifnot(is.ffdf(X))
    xchunks <- chunk(X)
    result <- NULL
    if(MARGIN==1){  
        for (i in xchunks){         
            res.chunk <- apply(X[i, ], MARGIN=1, FUN=FUN, ...)
            if(is.data.frame(res.chunk)){
                result <- ffdfappend(result, res.chunk)
            }else{
                result <- ffappend(result, res.chunk)
            }
    }   
    }else{
        stop("only MARGIN=1 currently allowed")
    }
    result
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM