简体   繁体   中英

Reading contents of a gzip file from a AWS S3 in R

i'm trying to read my gziped csv file from S3

Given that I have a list of my data already like

> MyKeys
[1] "2020/07/25/21/0001_part_00.gz" "2020/07/25/22/0000_part_00.gz" "2020/07/25/22/0001_part_00.gz" "2020/07/25/23/0000_part_00.gz" "2020/07/25/23/0001_part_00.gz"

using

x<-get_object(MyKeys[1], bucket = bucket)

it returns

str(x)
 raw [1:42017043] 1f 8b 08 00 ...

i tryied to use

rawToChar(x)
gunzip(x, remove=FALSE)
read.table(rawConnection(get_object(MyKeys[1], bucket = bucket)))
read_delim(gzfile(get_object(touse[1], bucket = bucket)), ",", escape_double = FALSE, trim_ws = TRUE)

and a few more tricks that i dont remember.

and none of this worked.. i'm lost here.

well, after all I managed to find a solution.

df <- get_object(key, bucket = bucket) %>%
        rawConnection %>% 
        gzcon %>% 
        read_delim( "|", escape_double = FALSE,  trim_ws = TRUE, col_names = FALSE)

explaining a bit for anyone who finds himself in this kind of trouble

the method Get_object is the main S3 method. With rawConnection you can stream the gzcon which is the way to read and descompress a Gzip File (some sort of bitstream I dont know why it is this way...) finaly read_delim which is no mistery for anyone. and it is legen... wait for it... there is a trick here. when using RawConnection R allocates internally a vector for your file. and it STAYS there until you close it. usually you create one object and then close it like

x<- rawConnection(<args>)
close(x)

but in this case its created on the fly using magrittr's '%>%' so i dont have the reference.
if you are doing the same as I am, and you are reading months of data in thousands of files in a loop you will recive the error message

All the connections are in use

worry not.. Rawconnection store 128 files...tops.. so if you store into a local file or variable and use the "garbage collector method" closeAllConnections() and it wipes all stored files as rawconnections

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM