简体   繁体   中英

Read large csv file from S3 into R

I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:

library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")  
csvcharobj <- rawToChar(obj)  
con <- textConnection(csvcharobj)  
data <- read.csv(file = con)

Now, with the file being much bigger than usual, I receive an error

> csvcharobj <- rawToChar(obj)  
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68

Reading this post , I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?

You can use AWS Athena and mount your S3 files to athena and query only selective records to R. How to run r with athena is explained in detail below.

https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/

Hope it helps.

如果您使用的是Spark或类似产品,则另一个解决方法是-将cv读取/加载到DataTable中,以及-继续使用R Server / sparklyr处理它

Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.

At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.

Thus

data <- 
    aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")

Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:

data <- 
    aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
    janitor::clean_names()

Previously the more verbose method below was required:

library(aws.s3)

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv") %>%
  data.table::fread()

It works for files up to at least 305 MB.

A better alternative to filling up your working directory with a copy of every csv you load:

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv",
              file = tempfile(fileext = ".csv")
             ) %>%
  fread()

If you are curious about where the tempfile is positioned, then Sys.getenv() can give some insight - see TMPDIR TEMP or TMP . More information can be found in the Base R tempfile docs. .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM