简体   繁体   中英

Compare item in one row against all other rows and loop through all rows using data.table - R

I'm combining similar names using stringdist() , and have it working using lapply , but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. Here's an example and my attempted solution so far built from readings here , here , here , here , and here , but I'm not quite pulling it off:

library(stringdist)
library(data.table)
data("mtcars")
mtcars$cartype <- rownames(mtcars)
mtcars$id <- seq_len(nrow(mtcars))

I'm currently using lapply() to cycle through the strings in the cartype column and bring together those rows whose cartype names are closer than a specified value (.08).

output <- lapply(1:length(mtcars$cartype), function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ])

> output[1:3]
[[1]]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb       cartype id
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4     Mazda RX4  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4 Mazda RX4 Wag  2

[[2]]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb       cartype id
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4     Mazda RX4  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4 Mazda RX4 Wag  2

[[3]]
            mpg cyl disp hp drat   wt  qsec vs am gear carb    cartype id
Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1 Datsun 710  3

Data Table Attempt:

mtcarsdt <- as.data.table(mtcars)    
myfun <- function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ]

An intermediate step: This code pulls similar names based on the row's value that I manually plug into myfun() , but it repeats that value for all the rows.

res <- mtcarsdt[,.(vlist = list(myfun(1))),by=id]
res$vlist[[1]] #correctly combines the 2 mazda names
res$vlist[[6]] #but it's repeated down the line

I'm now trying to cycle through all the rows using set() . I'm close, but although the code appears to be correctly matching the text from the 12th column ( cartype ) it's returning the values from the first column, mpg :

for (i in 1:32) set(mtcarsdt,i ,12L, myfun(i))
> mtcarsdt
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb                   cartype id
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4                 c(21, 21)  1
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4                 c(21, 21)  2
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1                      22.8  3

Now, this is pretty hacky, but I found that if I create a copy of the cartype column and place it in the first column it pretty much works, but there must be a cleaner way to do this. Also, it would be nice to keep the output in a list form like the lapply() output above as I have other post-processing steps set up for that format.

mtcars$cartypeorig <- mtcars$cartype
mtcars <- mtcars[,c(14,1:13)]
mtcarsdt <- as.data.table(mtcars)
for (i in 1:32) set(mtcarsdt,i ,13L, myfun(i))

 > mtcarsdt[1:14,cartype]
 [1] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"                        
 [2] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"                        
 [3] "Datsun 710"                                                 
 [4] "Hornet 4 Drive"                                             
 [5] "Hornet Sportabout"                                          
 [6] "Valiant"                                                    
 [7] "Duster 360"                                                 
 [8] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\")"               
 [9] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[10] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[11] "c(\"Merc 230\", \"Merc 280\", \"Merc 280C\")"               
[12] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
[13] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
[14] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         

Have you tried using the matrix version of stringdist ?

res = stringdistmatrix(mtcars$cartype, mtcars$cartype, method = 'jw', p = 0.08)

out = as.data.table(which(res < 0.08, arr.ind = T))[, .(list(mtcars[row,])), by = col]$V1

identical(out, output)
#[1] TRUE

Now, you probably can't just run the above for a 500k X 500k matrix, but you can split it into smaller pieces (pick size appropriate for your data/memory sizes):

size = 4 # dividing into pieces of size 4x4
         # I picked a divisible number, a little more work will be needed
         # if you have a residue (nrow(mtcars) = 32)
setDT(mtcars)

grid = CJ(seq_len(nrow(mtcars)/4), seq_len(nrow(mtcars)/4))

indices = grid[, {
            res = stringdistmatrix(mtcars[seq((V1-1)*size+1, (V1-1)*size + size), cartype],
                                   mtcars[seq((V2-1)*size+1, (V2-1)*size + size), cartype],
                                   method = 'jw', p = 0.08)
            out = as.data.table(which(res < 0.08, arr.ind = T))
            if (nrow(out) > 0)
              out[, .(row = (V1-1)*size+row, col = (V2-1)*size +col)]
          }, by = .(V1, V2)]

identical(indices[, .(list(mtcars[row])), by = col]$V1, lapply(output, setDT))
#[1] TRUE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM