从 GCS 存储桶中读取 header csv python

Question

我想从云存储 GCP 中的 csv 中提取 header。 问题是我提取了 header，但我有一个超过 20GB 的 csv 文件。

我用了一个图书馆。 它可以提取 header，但需要很多 memory。

import gcsfs

fs = gcsfs.GCSFileSystem(project=PROJECT)
with fs.open(f'{bucket}/{file}', 'rb') as f:
    schema = f.read().decode("utf-8") 
    # Remove all words after the first new line
    schema = re.sub("(\\n).*", "", schema)

我也试过这个命令，但它什么也没返回：

fs.read_block('gs://my-bucket/my-file.txt', offset=1000, length=10, delimiter=b'\n')

我的问题是如何只读取 header 而不是所有文件。

Answer 1

 schema = f.read()

这将读取整个文件。 据推测，如果gcsfs.GCSFileSystem.open像内置文件open一样工作，它应该采用 integer 参数来指定要读取的字节数。

例如，如果 header 大小为 100 字节，请尝试：

schema = f.read(100)

或者，如果 header 是文件中的第一行，用\n字符分隔，请尝试

schema = f.readline()

从 GCS 存储桶中读取 header csv python

问题描述

1 个解决方案

解决方案1
1 2021-05-24 19:28:01

从 GCS 存储桶中读取 header csv python

问题描述

1 个解决方案

解决方案1 1 2021-05-24 19:28:01

解决方案1
1 2021-05-24 19:28:01