如何从文件对象中读取带有 pyarrow 的 csv.gz 文件？

Question

I am trying to read a bunch of gzip-compressed csv files from S3 using pyarrow.我正在尝试使用 pyarrow 从 S3 读取一堆 gzip 压缩的 csv 文件。 The documentation page of pyarrow.csv.read_csv says pyarrow.csv.read_csv的文档页面说

If a string or path, and if it ends with a recognized compressed file extension (eg “.gz” or “.bz2”如果是字符串或路径，并且以可识别的压缩文件扩展名结尾（例如“.gz”或“.bz2”

Unfortunately, I cannot provide a string value as the input path, so the CSV reader assumes no compression.不幸的是，我无法提供字符串值作为输入路径，因此 CSV 阅读器假定没有压缩。

import s3fs
import pyarrow.csv as pv

s3 = s3fs.core.S3FileSystem(anon=False)

csv_path = 's3://bucket_name/path/to/file.csv.gz'

with s3.open(csv_path) as s3fp:
    table = pv.read_csv(s3fp)

I tried to dig deeper into pyarrow internals but I wasn't able to identify a way to pass an additional argument for compression type.我试图更深入地研究 pyarrow 内部结构，但我无法确定一种方法来传递压缩类型的附加参数。

Answer 1

Found a workaround for it.找到了解决方法。 It is possible to add a gzip decompression in between before reading the csv from the file handler:在从文件处理程序读取 csv 之前，可以在两者之间添加 gzip 解压缩：

import gzip
import s3fs
import pyarrow.csv as pv

s3 = s3fs.core.S3FileSystem(anon=False)

csv_path = 's3://bucket_name/path/to/file.csv.gz'

with s3.open(csv_path) as s3fp:
    with gzip.open(s3fp) as fp:
        table = pv.read_csv(fp)

如何从文件对象中读取带有 pyarrow 的 csv.gz 文件？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-29 14:07:14

如何从文件对象中读取带有 pyarrow 的 csv.gz 文件？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-29 14:07:14

解决方案1
1 已采纳 2020-10-29 14:07:14