[英]R Markdown file slow to knit due to large dataset
I am new to R Markdown.我是 R Markdown 的新手。 Apologies if the question has an obvious answer that I missed.
如果问题有一个我错过的明显答案,我深表歉意。
Context:语境:
I am working with a large dataset in R Markdown (roughly 90 million rows) to produce a short report.我正在使用 R Markdown(大约 9000 万行)中的大型数据集来生成简短报告。 While working on the file formatting, I want to knit the final HTML document frequently (eg, after making a change) to look at the formatting.
在处理文件格式时,我想经常(例如,在进行更改后)编织最终的 HTML 文档以查看格式。
Problem:问题:
The problem is the dataset takes a long time to be load and so the each knit takes a long time to be executed (roughly five to ten minutes).问题是数据集需要很长时间才能加载,因此每个编织需要很长时间才能执行(大约五到十分钟)。 I do need all of the data, so loading in a smaller file isn't a workable option.
我确实需要所有数据,因此加载较小的文件不是一个可行的选择。 Of course, I am able to code the individual chunks since the data are loaded into the global environment, but formatting is in credibly onerous since it is difficult to visualize the result of formatting changes without looking at the knitted product.
当然,由于数据被加载到全局环境中,我可以对各个块进行编码,但是格式化非常繁重,因为如果不查看针织产品就很难可视化格式化更改的结果。
Attempts to solve the issue:解决问题的尝试:
After some research, I found and tried to use cache = TRUE
and cache.extra = file.mtime('my-precious.csv')
(as per this section of Yihui's Bookdown) .经过一番研究,我发现并尝试使用
cache = TRUE
和cache.extra = file.mtime('my-precious.csv')
(根据 Yihui 的 Bookdown 的这一部分) 。 However, this option didn't work as it resulted in the following:但是,此选项不起作用,因为它导致以下结果:
Error in lazyLoadDBinsertVariable(vars[i], from, datafile, ascii, compress, :
long vectors not supported yet: connections.c:6073
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> lazyLoadDBinsertVariable
To overcome this error, I added cache.lazy = FALSE
into the chunk options ( as mentioned here ).为了克服这个错误,我将
cache.lazy = FALSE
添加到块选项中( 如此处所述)。 Unfortunately, while the code worked, the time it took to knit the document did not go down.不幸的是,虽然代码有效,但编织文档所需的时间并没有减少 go。
My limited understanding of this process is that having cache = TRUE
and cache.extra = file.mtime('my-precious.csv')
will lead to a code chunk's executed results be cached so that the next time the file is knit, results from the previous run are loaded.我对这个过程的有限理解是,拥有
cache = TRUE
和cache.extra = file.mtime('my-precious.csv')
将导致代码块的执行结果被缓存,以便下次编织文件时,结果从上一次运行中加载。 However, because my file is too large, cache = TRUE
doesn't work so I have to use cache.lazy = FALSE
to turn reverse what is done by cache = TRUE
.但是,由于我的文件太大,
cache = TRUE
不起作用,所以我必须使用cache.lazy = FALSE
来反转cache = TRUE
所做的事情。 In the end, this means that the dataset is being loaded into my memory each time I run the file, thereby lengthening the time it take to knit the document.最后,这意味着每次我运行文件时数据集都被加载到我的 memory 中,从而延长了编织文档所需的时间。
Questions to which I seek answers from the R community:我从 R 社区寻求答案的问题:
cache = TRUE
method of circumventing the time-intensive data-loading process correct?cache = TRUE
方法的理解是否正确? And if it isn't, why didn't the cache = TRUE
method work for me?cache = TRUE
方法对我不起作用? Any help is appreciated.任何帮助表示赞赏。
Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?每次我编织 R Markdown 文件时,是否有(更好的)方法来规避耗时的数据加载过程?
Yes.是的。 Perform your computations outside of the Rmarkdown report.
在 Rmarkdown 报告之外执行您的计算。
Plots can be saved into files and included into the report via knitr::include_graphics(myfile)
Tables can be saved into smaller summary files, loaded via fread
and displayed via kable
.绘图可以保存到文件中并通过
knitr::include_graphics(myfile)
包含在报告中。表格可以保存到较小的摘要文件中,通过fread
加载并通过kable
显示。
Note that if you need to print tables in a loop, you should specify the result=asis
chunk option.请注意,如果您需要循环打印表格,则应指定
result=asis
块选项。
```{r my_chunk_label, results='asis', echo=F}
for(i in 1:length(data_full)) {
print(kable(data_full[[i]]))
cat('\n')
}
```
Run your expensive computations once, save the results.运行一次昂贵的计算,保存结果。 Consume these results with a light
Rmarkdown
report that is easy to format.使用易于格式化的轻量级
Rmarkdown
报告来使用这些结果。
If you still have large csv files to load, you should use data.table::fread
which is much more efficient than base functions.如果你还有大的 csv 文件要加载,你应该使用
data.table::fread
比基本函数更有效。
I actually posted a similar question not so long ago.不久前,我实际上发布了一个类似的问题。 You're not alone.
你不是一个人。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.