简体   繁体   English

R:建议加快功能(删除数据框中的重复项)

[英]R: Suggestion to speed up a function (remove duplicates in data frame)

I run into a bit of trouble with my code and would welcome any suggestion to make it run faster. 我的代码遇到了麻烦,欢迎提出任何建议以使其运行更快。 I have a data frame that looks like that: 我有一个看起来像这样的数据框:

Name <- c("a","a","a","a","a","b","b","b","b","c")

Category <- c("sun","cat","sun","sun","sea","sun","sea","cat","dog","cat")

More_info <- c("table","table","table","table","table","table","table","table","table","cat")
d <- data.frame(Name,Category,More_info)

So I have duplicated entries for each row in column Name (the number of duplicates can vary). 因此,我在“名称”列中的每一行都有重复的条目(重复的次数可以变化)。 For each entry (a,b,...) I want to count the sum of each corresponding element in the Category column and keep the only category that appears the most. 对于每个条目(a,b,...),我想计算“类别”列中每个相应元素的总和,并保持出现最多的唯一类别。 If an entry has an equal number of categories, I want to take one of most categories randomly. 如果条目具有相等数量的类别,我想随机抽取大多数类别之一。 So in this case, the output dataframe would look like this: 因此,在这种情况下,输出数据帧将如下所示:

Name <- c("a","b","c")

Category <- c("sun","dog","cat")

More_info <- c("table","table","table")
d <- data.frame(Name,Category,More_info)

a have sun entry kept because it appears the most, b would be dog or whatever other value as they all appear once with b, and c wouldn't be changed. a保持太阳进入,因为它看起来最多,b是狗或其他任何值,因为它们与b一起出现一次,并且c不会改变。 My function looks like this: 我的函数如下所示:

    my_choosing_function <- function(x){
      tmp = dbSNP_hapmap[dbSNP_hapmap$refsnp_id==list_of_snps[x],]
      snp_freq <- as.data.frame(table(tmp$consequence_type_tv)) 
       best_hit <- snp_freq[order(-snp_freq$Freq),]
      best_hit$SNP<-list_of_snps[x]
      top<-best_hit[1,]
      return(top)
    }
    trst <- lapply(1:length(list_of_snps), function(x) my_choosing_function(x))
final <- do.call("rbind",trst)

Where I start from a list of unique elements (that would be Name in our case), for each element I do a table of the duplicated entries, I order the table by descending values and keep the top element. 我从一个唯一元素的列表开始(在本例中为Name),对于每个元素,我都制作了一个重复条目的表,并按降序对表进行排序,并保留顶部元素。 I do a lapply for each element of the list of unique values, then do a rbind of the whole thing. 我对唯一值列表中的每个元素都执行lapply,然后对整个事情进行rbind。

As I have 2500000 rows in my initial data frame and 1500000 unique elements, it takes forever to run. 因为我的初始数据帧中有2500000行和1500000个唯一元素,所以要花很多时间才能运行。 4 seconds for 100 lines, that would be a total of 34 hours for the lapply. 100条线需要4秒钟,耗时总计34个小时。

I'm sure packages like dplyr can do it in a few minutes but can't find a solution to do it. 我敢肯定,像dplyr这样的软件包可以在几分钟内完成,但找不到解决方案。 Anyone has an idea? 有人有主意吗? Thanks a lot for your help! 非常感谢你的帮助!

Note: This should be a very long comment because I use data.table instead of dplyr . 注意:这应该是一个很长的注释,因为我使用data.table而不是dplyr

I suggest use data.table because it runs faster. 我建议使用data.table因为它运行速度更快。 And in the data.table way shown below, it randomly choose one in case of tie, not always the first one. 并以下面显示的data.table方式,在平局的情况下随机选择一个,而不总是第一个。

library(data.table)
library(dplyr)
library(microbenchmark)

d <- data.frame(
    Name = as.character(sample.int(10000, 2.5e6, replace = T)),
    Category = as.character(sample.int(10000, 2.5e6, replace = T)),
    More_info = rep('table', 2.5e6)
)

Mode <- function(x) {
    ux <- unique(x)
    fr1 <- tabulate(match(x, ux))
    if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

system.time({
    d %>%
        group_by(Name) %>%
        slice(which(Category == Mode(Category))[1])
})
#    user  system elapsed
#  45.932   0.808  46.745

system.time({
    dt <- as.data.table(d)
    dt.max <- dt[, .N, by = .(Name, Category)]
    dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
    dt.max <- dt.max[r == 1, .(Name, Category)]

    dt[dt.max, on = .(Name, Category), mult = 'first']
})
#    user  system elapsed
#   2.424   0.004   2.426

We can modify the Mode function from here and then do a group by filter 我们可以从这里修改Mode功能,然后按filter分组

library(dplyr)

Mode <- function(x) {
 ux <- unique(x)
 fr1 <- tabulate(match(x, ux))
  if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

d %>% 
  group_by(Name) %>%
  slice(which(Category == Mode(Category))[1])

A couple slight tweaks on @mt1022's solution can produce a marginal speedup, nothing to phone home about, but might be of use if you find your data grows another order of magnitude. 对@ mt1022的解决方案有几个轻微的调整会产生边际加速,没有打电话回家一下,但如果你发现你的数据的增长另一个量级可能是有用的。

library(data.table)
library(dplyr)

d <- data.frame(
 Name = as.character(sample.int(10000, 2.5e6, replace = T)),
 Category = as.character(sample.int(5000, 2.5e6, replace = T)),
 More_info = rep('table', 2.5e6)
)

Mode <- function(x) {
 ux <- unique(x)
 fr1 <- tabulate(match(x, ux))
 if(n_distinct(fr1)==1) ux[sample(seq_along(fr1), 1)] else ux[which.max(fr1)]
}

system.time({
 d %>%
   group_by(Name) %>%
   slice(which(Category == Mode(Category))[1])
})

# user   system elapsed 
# 40.459   0.180  40.743 

system.time({
 dt <- as.data.table(d)
 dt.max <- dt[, .N, by = .(Name, Category)]
 dt.max[, r := frank(-N, ties.method = 'random'), by = .(Name)]
 dt.max <- dt.max[r == 1, .(Name, Category)]

 dt[dt.max, on = .(Name, Category), mult = 'first']
})

# user  system elapsed 
# 4.196   0.052   4.267 

Tweaks include 调整包括

  • Use setDT() instead of as.data.table() to avoid making a copy 使用setDT()而不是as.data.table()避免进行复制
  • Using stats::runif() to generate the random tiebreaker directly, this is of what data.table is doing internally in the the random option of frank() 使用stats::runif()直接生成随机的决胜局,这就是data.tablefrank()的random选项内部进行的操作
  • Using setkey() to sort the table 使用setkey()对表进行排序
  • Sub-setting the table by the row indices, .I , where the row within each group is equal to the number of observations, .N in each group. 用行索引.I设置表子集,其中每组中的行等于每组中观察值.N (This returns the last row of each group) (这将返回每个组的最后一行)

Results: 结果:

system.time({
 dt.max <- setDT(d)[, .(Count = .N), keyby = .(Name, Category)]
 dt.max[,rand := stats::runif(.N)]
 setkey(dt.max,Name,Count, rand)
 dt.max[dt.max[,.I[.N],by = .(Name,Category)]$V1,.(Name,Category,Count)]
})

# user  system elapsed 
# 1.722   0.057   1.750 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM