简体   繁体   English

最大化循环速度/应用 function

[英]Maximizing speed of a loop/apply function

I am quite struggling with a huge data set at the moment.目前,我正在为庞大的数据集而苦苦挣扎。 What I would like to do is not very complicated, but the matter is that it is just too slow.我想做的不是很复杂,但问题是它太慢了。 In the first step, I need to check whether a website is active or not.第一步,我需要检查一个网站是否处于活动状态。 For this intention, I used the following code (here with a sample of three API-pathes)为此,我使用了以下代码(这里有一个包含三个 API 路径的示例)

library(httr)

Updated <- function(x){http_error(GET(x))}  
websites <- data.frame(c("https://api.crunchbase.com/v3.1/organizations/designpitara","www.twitter.com","www.sportschau.de"))
abc <- apply(websites,1,Updated)

I already noticed that a for loop is pretty much faster than the apply function.我已经注意到 for 循环比应用 function 快得多。 However, the full code (which has around 1MIllion APIs to check) still would take around 55 hours to be executed.但是,完整的代码(有大约 100 万个 API 需要检查)仍然需要大约 55 小时才能执行。 Any help is appreciated:)任何帮助表示赞赏:)

The primary limiting factor will probably be the time taken to query the website.主要的限制因素可能是查询网站所花费的时间。 Currently, you're waiting for each query to return a result before executing the next one.目前,您正在等待每个查询返回结果,然后再执行下一个查询。 The best way to speed up the workflow would be to execute batches of queries in parallel.加快工作流程的最佳方法是并行执行批量查询。

If you're using a Unix system you could try the following:如果您使用的是 Unix 系统,您可以尝试以下操作:

### Packages ###
library(parallel)

### On your example ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 3))

### On a larger number of sites ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = detectCores())

### You can even go beyond your machine's core count ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 40))

However, the precise number of threads at which you saturate your processor/internet connection is kind of dependent upon your machine and your connection.但是,使处理器/互联网连接饱和的确切线程数在某种程度上取决于您的机器和您的连接。

Alternatively, if you're stuck on Windows:或者,如果您卡在 Windows 上:

### For a larger number of sites ###
cl <- makeCluster(detectCores(), type = "PSOCK")
clusterExport(cl, varlist = "websites")
clusterEvalQ(cl = cl, library(httr))
abc <- parSapply(cl = cl, X = websites[[1]], FUN = Updated, USE.NAMES = FALSE)
stopCluster(cl)

In the case of PSOCK clusters, I'm not sure whether there are any benefits to be had from exceeding your machine's core count, although I'm not a Windows person, and I welcome any correction.在 PSOCK 集群的情况下,我不确定超过机器的核心数量是否有任何好处,尽管我不是 Windows 人,我欢迎任何更正。

Alternatively, something like this would work for passing multiple libraries to the PSOCK cluster:或者,类似这样的方法可以将多个库传递给 PSOCK 集群:

clusterEvalQ(cl, {
     library(data.table)
     library(survival)
})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM