简体   繁体   English

R Markdown 文件由于数据集大而编织缓慢

[英]R Markdown file slow to knit due to large dataset

I am new to R Markdown.我是 R Markdown 的新手。 Apologies if the question has an obvious answer that I missed.如果问题有一个我错过的明显答案,我深表歉意。

Context:语境:

I am working with a large dataset in R Markdown (roughly 90 million rows) to produce a short report.我正在使用 R Markdown(大约 9000 万行)中的大型数据集来生成简短报告。 While working on the file formatting, I want to knit the final HTML document frequently (eg, after making a change) to look at the formatting.在处理文件格式时,我想经常(例如,在进行更改后)编织最终的 HTML 文档以查看格式。

Problem:问题:

The problem is the dataset takes a long time to be load and so the each knit takes a long time to be executed (roughly five to ten minutes).问题是数据集需要很长时间才能加载,因此每个编织需要很长时间才能执行(大约五到十分钟)。 I do need all of the data, so loading in a smaller file isn't a workable option.我确实需要所有数据,因此加载较小的文件不是一个可行的选择。 Of course, I am able to code the individual chunks since the data are loaded into the global environment, but formatting is in credibly onerous since it is difficult to visualize the result of formatting changes without looking at the knitted product.当然,由于数据被加载到全局环境中,我可以对各个块进行编码,但是格式化非常繁重,因为如果不查看针织产品就很难可视化格式化更改的结果。

Attempts to solve the issue:解决问题的尝试:

After some research, I found and tried to use cache = TRUE and cache.extra = file.mtime('my-precious.csv') (as per this section of Yihui's Bookdown) .经过一番研究,我发现并尝试使用cache = TRUEcache.extra = file.mtime('my-precious.csv') (根据 Yihui 的 Bookdown 的这一部分) However, this option didn't work as it resulted in the following:但是,此选项不起作用,因为它导致以下结果:

Error in lazyLoadDBinsertVariable(vars[i], from, datafile, ascii, compress,  : 
  long vectors not supported yet: connections.c:6073
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> lazyLoadDBinsertVariable

To overcome this error, I added cache.lazy = FALSE into the chunk options ( as mentioned here ).为了克服这个错误,我将cache.lazy = FALSE添加到块选项中( 如此处所述)。 Unfortunately, while the code worked, the time it took to knit the document did not go down.不幸的是,虽然代码有效,但编织文档所需的时间并没有减少 go。

My limited understanding of this process is that having cache = TRUE and cache.extra = file.mtime('my-precious.csv') will lead to a code chunk's executed results be cached so that the next time the file is knit, results from the previous run are loaded.我对这个过程的有限理解是,拥有cache = TRUEcache.extra = file.mtime('my-precious.csv')将导致代码块的执行结果被缓存,以便下次编织文件时,结果从上一次运行中加载。 However, because my file is too large, cache = TRUE doesn't work so I have to use cache.lazy = FALSE to turn reverse what is done by cache = TRUE .但是,由于我的文件太大, cache = TRUE不起作用,所以我必须使用cache.lazy = FALSE来反转cache = TRUE所做的事情。 In the end, this means that the dataset is being loaded into my memory each time I run the file, thereby lengthening the time it take to knit the document.最后,这意味着每次我运行文件时数据集都被加载到我的 memory 中,从而延长了编织文档所需的时间。

Questions to which I seek answers from the R community:我从 R 社区寻求答案的问题:

  1. Is there a way to cache the data-loading chunk in R Markdown when the file size is large (~90 million rows)?当文件大小很大(约 9000 万行)时,有没有办法在 R Markdown 中缓存数据加载块?
  2. Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?每次我编织 R Markdown 文件时,是否有(更好的)方法来规避耗时的数据加载过程?
  3. Is my understanding of the cache = TRUE method of circumventing the time-intensive data-loading process correct?我对绕过耗时的数据加载过程的cache = TRUE方法的理解是否正确? And if it isn't, why didn't the cache = TRUE method work for me?如果不是,为什么cache = TRUE方法对我不起作用?

Any help is appreciated.任何帮助表示赞赏。

Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?每次我编织 R Markdown 文件时,是否有(更好的)方法来规避耗时的数据加载过程?

Yes.是的。 Perform your computations outside of the Rmarkdown report.在 Rmarkdown 报告之外执行您的计算。

Plots can be saved into files and included into the report via knitr::include_graphics(myfile) Tables can be saved into smaller summary files, loaded via fread and displayed via kable .绘图可以保存到文件中并通过knitr::include_graphics(myfile)包含在报告中。表格可以保存到较小的摘要文件中,通过fread加载并通过kable显示。

Note that if you need to print tables in a loop, you should specify the result=asis chunk option.请注意,如果您需要循环打印表格,则应指定result=asis块选项。

```{r my_chunk_label, results='asis', echo=F}
for(i in 1:length(data_full)) {
  print(kable(data_full[[i]]))
  cat('\n')
}
```

Run your expensive computations once, save the results.运行一次昂贵的计算,保存结果。 Consume these results with a light Rmarkdown report that is easy to format.使用易于格式化的轻量级Rmarkdown报告来使用这些结果。

If you still have large csv files to load, you should use data.table::fread which is much more efficient than base functions.如果你还有大的 csv 文件要加载,你应该使用data.table::fread比基本函数更有效。

I actually posted a similar question not so long ago.不久前,我实际上发布了一个类似的问题 You're not alone.你不是一个人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM