简体   繁体   中英

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more.

The problem is that the flat files I typically work with are sufficiently large that they cannot be read into R without help. So, I would ideally prefer a way to make the conversion without actually need to read the dataset into R in the first place.

Any help you can provide would be much appreciated!

arrow::open_dataset() can work on a directory of files and query them without reading everything into memory. If you do want to rewrite the data into multiple files, potentially partitioned by one or more columns in the data, you can pass the Dataset object to write_dataset() .

One (temporary) caveat: as of {arrow} 3.0.0, open_dataset() only accepts a directory, not a single file path. We plan to accept a single file path or list of discrete file paths in the next release (see issue ), but for now if you need to read only a single file that is in a directory with other non-data files, you'll need to move/symlink it into a new directory and open that.

You can do it in this way:

library(arrow)
library(dplyr)
 
csv_file <- "obs.csv"
dest <- "obs_parquet/" 

sch = arrow::schema(checklist_id = float32(),
                    species_code = string())

csv_stream <- open_dataset(csv_file, format = "csv", 
                           schema = sch, skip_rows = 1)

write_dataset(csv_stream, dest, format = "parquet", 
              max_rows_per_file=1000000L,
              hive_style = TRUE,
              existing_data_behavior = "overwrite")

In my case (56GB csv file), I had a really weird situation with the resulting parquet tables, so double check your parquet tables to spot any funky new rows that didn't exist in the original csv. I filed a bug report about it:

https://issues.apache.org/jira/browse/ARROW-17432

If you also experience the same issue, use the Python Arrow library to convert the csv into parquet and then load it into R. The code is also in the Jira ticket.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM