[英]R: Reading first n rows from parquet file?
感謝 Jon 和 Dan 指出了正確的方向。
arrow::open_dataset()
允許延遲評估(文檔 [此處][1]),然后您可以從中獲取head()
(但不能slice()
)或filter()
。 這個過程更快,並且使用更少的峰值內存。 下面的例子。
# https://stackoverflow.com/questions/73131505/r-reading-first-n-rows-from-parquet-file
library(dplyr)
library(arrow)
library(tictoc) #optional, used to time results
tic("read all of large parquet file")
my_animals <- read_parquet("data/my_animals.parquet")
toc() # slow and uses heaps of ram
tic("read parquet and write mini version")
my_animals <- open_dataset("data/my_animals.parquet")
my_animals # this is a lazy object
my_animals %>%
#slice(1000L) %>% #doesn't work
head(n=1000L) %>%
# filter(YEAROFBIRTH >= 2010) %>% #also works
compute() %>%
write_parquet("data/my_animals_mini.parquet") # optional
toc() # much faster, much less peak ram used
[1]: https://arrow.apache.org/docs/r/articles/dataset.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.