[英]Compare item in one row against all other rows and loop through all rows using data.table - R
I'm combining similar names using stringdist()
, and have it working using lapply
, but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. 我正在使用stringdist()
组合相似的名称,并使用lapply
使其工作,但是要花50个小时来运行11万行,我想看看data.table解决方案是否能更快地工作。 Here's an example and my attempted solution so far built from readings here , here , here , here , and here , but I'm not quite pulling it off: 这是一个示例,到目前为止,我尝试的解决方案是根据此处 , 此处 , 此处 , 此处和此处的读数构建的,但我不太了解它:
library(stringdist)
library(data.table)
data("mtcars")
mtcars$cartype <- rownames(mtcars)
mtcars$id <- seq_len(nrow(mtcars))
I'm currently using lapply()
to cycle through the strings in the cartype
column and bring together those rows whose cartype
names are closer than a specified value (.08). 我目前正在使用lapply()
在cartype
列中的字符串之间循环,并将其cartype
名称比指定值(.08)更近的那些行放在一起。
output <- lapply(1:length(mtcars$cartype), function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ])
> output[1:3]
[[1]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4 1
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag 2
[[2]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4 1
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag 2
[[3]]
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 Datsun 710 3
Data Table Attempt: 数据表尝试:
mtcarsdt <- as.data.table(mtcars)
myfun <- function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ]
An intermediate step: This code pulls similar names based on the row's value that I manually plug into myfun()
, but it repeats that value for all the rows. 中间步骤:该代码根据我手动插入myfun()
的行的值提取相似的名称,但是它将对所有行重复该值。
res <- mtcarsdt[,.(vlist = list(myfun(1))),by=id]
res$vlist[[1]] #correctly combines the 2 mazda names
res$vlist[[6]] #but it's repeated down the line
I'm now trying to cycle through all the rows using set()
. 我现在正在尝试使用set()
遍历所有行。 I'm close, but although the code appears to be correctly matching the text from the 12th column ( cartype
) it's returning the values from the first column, mpg
: 我已经接近了,但是尽管代码似乎与第12列( cartype
)的文本正确匹配,但它从第一列mpg
返回值:
for (i in 1:32) set(mtcarsdt,i ,12L, myfun(i))
> mtcarsdt
mpg cyl disp hp drat wt qsec vs am gear carb cartype id
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 c(21, 21) 1
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 c(21, 21) 2
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 3
Now, this is pretty hacky, but I found that if I create a copy of the cartype
column and place it in the first column it pretty much works, but there must be a cleaner way to do this. 现在,这很麻烦,但是我发现,如果我创建了cartype
列的副本并将其放置在第一列中,则可以正常工作,但是必须有一种更cartype
方法来做到这一点。 Also, it would be nice to keep the output in a list form like the lapply()
output above as I have other post-processing steps set up for that format. 另外,最好将输出保持在列表形式,例如上面的lapply()
输出,因为我为该格式设置了其他后处理步骤。
mtcars$cartypeorig <- mtcars$cartype
mtcars <- mtcars[,c(14,1:13)]
mtcarsdt <- as.data.table(mtcars)
for (i in 1:32) set(mtcarsdt,i ,13L, myfun(i))
> mtcarsdt[1:14,cartype]
[1] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"
[2] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"
[3] "Datsun 710"
[4] "Hornet 4 Drive"
[5] "Hornet Sportabout"
[6] "Valiant"
[7] "Duster 360"
[8] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\")"
[9] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[10] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[11] "c(\"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[12] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
[13] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
[14] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"
Have you tried using the matrix version of stringdist
? 您是否尝试过使用stringdist
的矩阵版本?
res = stringdistmatrix(mtcars$cartype, mtcars$cartype, method = 'jw', p = 0.08)
out = as.data.table(which(res < 0.08, arr.ind = T))[, .(list(mtcars[row,])), by = col]$V1
identical(out, output)
#[1] TRUE
Now, you probably can't just run the above for a 500k X 500k matrix, but you can split it into smaller pieces (pick size appropriate for your data/memory sizes): 现在,您可能不能仅对500k X 500k矩阵运行上述操作,而是可以将其拆分成较小的部分(适合您的数据/内存大小的拾取大小):
size = 4 # dividing into pieces of size 4x4
# I picked a divisible number, a little more work will be needed
# if you have a residue (nrow(mtcars) = 32)
setDT(mtcars)
grid = CJ(seq_len(nrow(mtcars)/4), seq_len(nrow(mtcars)/4))
indices = grid[, {
res = stringdistmatrix(mtcars[seq((V1-1)*size+1, (V1-1)*size + size), cartype],
mtcars[seq((V2-1)*size+1, (V2-1)*size + size), cartype],
method = 'jw', p = 0.08)
out = as.data.table(which(res < 0.08, arr.ind = T))
if (nrow(out) > 0)
out[, .(row = (V1-1)*size+row, col = (V2-1)*size +col)]
}, by = .(V1, V2)]
identical(indices[, .(list(mtcars[row])), by = col]$V1, lapply(output, setDT))
#[1] TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.