简体   繁体   English

从海量数据框中选择元素的更有效方法

[英]More efficient way to select element from huge data frame

I have a huge data frame: 我有一个巨大的数据框:

library(gtools)
a<-permutations(2,20,v=c(0,1),repeats.allowed=TRUE)
a<-as.data.frame(a)

And I have 100 random strings: 我有100个随机字符串:

set.seed(123)

b<-replicate(100,sample(c(0,1),20, replace=T))

I would like to identify the row numbers in in 'a' that corresponds to each column in 'b' . 我想在'a'中标识与'b'每一列相对应的行号。

Since 'a' is huge this process takes quite some time. 由于'a'很大,因此此过程需要花费一些时间。

right now I am using the following method: 现在我正在使用以下方法:

sapply(1:100, function(x)  which(colSums(t(a)==as.numeric(b[,x]))==20L))

This process takes a lot of time. 此过程需要很多时间。 I was wondering if there is a more efficient way to do this? 我想知道是否有更有效的方法来做到这一点?

Represent the columns as digits by thinking of them as bit strings, then use %in% for fast look-up 通过将列视为位字符串将其表示为数字,然后使用%in%进行快速查找

library(gtools)
a <- permutations(2,20,v=c(0,1),repeats.allowed=TRUE)
a <- as.data.frame(a)

set.seed(123)
b <- replicate(100, sample(c(0, 1), 20, replace=TRUE))

a1 <- colSums(t(a) * 2^(0:19))
b1 <- colSums(b * 2^(0:19))

which produces 产生

> head(which(a1 %in% b1))
[1]  1191  9434 14502 19812 30619 34313

To deal with duplicates, consider this smaller example 要处理重复项,请考虑以下较小的示例

b1 <- c(1, 3, 3, 5, 4)
a1 <- c(3, 4, 8)

Discover the unique b1 values, and create a list that maps from the unique values to the index in the original values 发现唯一的b1值,并创建一个列表,该列表从唯一值映射到原始值中的索引

ub1 <- unique(b1)
umap <- unname(split(seq_along(b1), match(b1, ub1)))

Now match the a1 to the unique b1, decide which to keep (are not NA), and look up the matches in the unique map 现在将a1与唯一的b1匹配,决定保留哪个(不是NA),并在唯一的地图中查找匹配项

m <- match(a1, ub1)
keep <- which(!is.na(m))
keepmap <- umap[m[keep]]

Finally, use keepmap to figure out how many times each kept value needs to be replicated (because it maps to multiple original values) and create a data.frame of the results 最后,使用keepmap找出每个保留值需要复制多少次(因为它映射到多个原始值)并创建结果的data.frame

len <- sapply(keepmap, length)
data.frame(ai=rep(keep, len),
           a1=rep(a1[keep], len),
           b1=unlist(unname(keepmap)))

So a complete function is 所以一个完整的功能是

matchrows <-
    function(a, b)
{
    ## encode
    a1 <- colSums(t(a) * 2^(0:19))
    b1 <- colSums(b * 2^(0:19))

    ## match to unique values
    ub1 <- unique(b1)
    m <- match(a1, ub1)
    keep <- which(!is.na(m))

    ## expand unique matches to original coordinates
    umap <- unname(split(seq_along(b1), match(b1, ub1)))
    keepmap <- umap[m[keep]]

    len <- sapply(keepmap, length)
    data.frame(ai=rep(keep, len),
               bi=unlist(unname(keepmap)),
               value=rep(a1[keep], len))
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从R中的数据帧获取平均值并将其保存回相同数据帧的更有效方法 - More efficient way to get averages from data frame in R and save them back to same data frame 以更有效的方式将data.frame细分为列表 - Subset a data.frame into a list in a more efficient way 有没有更有效的方法在数据框中生成频率列 - Is there a more efficient way to generate a frequency column in a data frame 更有效的方法来获得跨数据帧列的频率计数 - More efficient way to get frequency counts across columns of data frame 从R data.frame中选择一组变量名称的最有效方法是什么? - What is the most efficient way to select a set of variable names from an R data.frame? 有没有一种有效的方法来“处理/丰富”R中矩阵/数据框的每个元素? - Is there an efficient way to "process/enrich" each element of a matrix/data frame in R? 如何有效地选择数据框中特定数量的元素? - How to select a specific numbers of elements within a Data Frame in an efficient way? 在 R 中重塑海量数据的有效方法 - Efficient way to Reshape Huge Data in R 将数据帧从R插入SQL的有效方法 - Efficient way to insert data frame from R to SQL 有没有更有效的方法来创建包含R中每个变量描述的data.frame? - Is there a more efficient way to create a data.frame containing descriptions of each variable in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM