简体   繁体   中英

How can I read a csv.gz file with pyarrow from a file object?

I am trying to read a bunch of gzip-compressed csv files from S3 using pyarrow. The documentation page of pyarrow.csv.read_csv says

If a string or path, and if it ends with a recognized compressed file extension (eg “.gz” or “.bz2”

Unfortunately, I cannot provide a string value as the input path, so the CSV reader assumes no compression.

import s3fs
import pyarrow.csv as pv

s3 = s3fs.core.S3FileSystem(anon=False)

csv_path = 's3://bucket_name/path/to/file.csv.gz'

with s3.open(csv_path) as s3fp:
    table = pv.read_csv(s3fp)

I tried to dig deeper into pyarrow internals but I wasn't able to identify a way to pass an additional argument for compression type.

Found a workaround for it. It is possible to add a gzip decompression in between before reading the csv from the file handler:

import gzip
import s3fs
import pyarrow.csv as pv

s3 = s3fs.core.S3FileSystem(anon=False)

csv_path = 's3://bucket_name/path/to/file.csv.gz'

with s3.open(csv_path) as s3fp:
    with gzip.open(s3fp) as fp:
        table = pv.read_csv(fp)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM