简体   繁体   中英

R Markdown file slow to knit due to large dataset

I am new to R Markdown. Apologies if the question has an obvious answer that I missed.

Context:

I am working with a large dataset in R Markdown (roughly 90 million rows) to produce a short report. While working on the file formatting, I want to knit the final HTML document frequently (eg, after making a change) to look at the formatting.

Problem:

The problem is the dataset takes a long time to be load and so the each knit takes a long time to be executed (roughly five to ten minutes). I do need all of the data, so loading in a smaller file isn't a workable option. Of course, I am able to code the individual chunks since the data are loaded into the global environment, but formatting is in credibly onerous since it is difficult to visualize the result of formatting changes without looking at the knitted product.

Attempts to solve the issue:

After some research, I found and tried to use cache = TRUE and cache.extra = file.mtime('my-precious.csv') (as per this section of Yihui's Bookdown) . However, this option didn't work as it resulted in the following:

Error in lazyLoadDBinsertVariable(vars[i], from, datafile, ascii, compress,  : 
  long vectors not supported yet: connections.c:6073
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> lazyLoadDBinsertVariable

To overcome this error, I added cache.lazy = FALSE into the chunk options ( as mentioned here ). Unfortunately, while the code worked, the time it took to knit the document did not go down.

My limited understanding of this process is that having cache = TRUE and cache.extra = file.mtime('my-precious.csv') will lead to a code chunk's executed results be cached so that the next time the file is knit, results from the previous run are loaded. However, because my file is too large, cache = TRUE doesn't work so I have to use cache.lazy = FALSE to turn reverse what is done by cache = TRUE . In the end, this means that the dataset is being loaded into my memory each time I run the file, thereby lengthening the time it take to knit the document.

Questions to which I seek answers from the R community:

  1. Is there a way to cache the data-loading chunk in R Markdown when the file size is large (~90 million rows)?
  2. Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?
  3. Is my understanding of the cache = TRUE method of circumventing the time-intensive data-loading process correct? And if it isn't, why didn't the cache = TRUE method work for me?

Any help is appreciated.

Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?

Yes. Perform your computations outside of the Rmarkdown report.

Plots can be saved into files and included into the report via knitr::include_graphics(myfile) Tables can be saved into smaller summary files, loaded via fread and displayed via kable .

Note that if you need to print tables in a loop, you should specify the result=asis chunk option.

```{r my_chunk_label, results='asis', echo=F}
for(i in 1:length(data_full)) {
  print(kable(data_full[[i]]))
  cat('\n')
}
```

Run your expensive computations once, save the results. Consume these results with a light Rmarkdown report that is easy to format.

If you still have large csv files to load, you should use data.table::fread which is much more efficient than base functions.

I actually posted a similar question not so long ago. You're not alone.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM