简体   繁体   English

如何使用Python从IBM Cloud Object Storage中读取Parquet文件的元数据?

[英]How to read Parquet file's metadata from IBM Cloud Object Storage in Python?

How to read a Parquet file's metadata (column names with types) from IBM COS in Python? 如何使用Python从IBM COS读取Parquet文件的元数据(带有类型的列名称)?

The only way I have found: 我发现的唯一方法:

           import pyarrow.parquet as pq
           import s3fs
           s3 = s3fs.S3FileSystem(anon=False, key='xxx', secret='xxx',
                   client_kwargs={'endpoint_url':
                                      "https://s3-api.us-geo.objectstorage.softlayer.net"}

           schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).read().schema

But it reads the whole file (I think). 但是它会读取整个文件(我认为)。

May be there is another approach to get the metadata from the Parquet file located in IBM COS? 也许还有另一种方法可以从位于IBM COS的Parquet文件中获取元数据?

If I use 如果我用

       schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).schema

It returns different data types. 它返回不同的数据类型。 For Strings: BYTE_ARRAY 对于字符串:BYTE_ARRAY

and for Timestamp: INT96 对于时间戳:INT96

Strange... 奇怪...

解决了:

schema = pq.ParquetDataset(bucket, filesystem=s3).schema.to_arrow_schema()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Google Cloud Storage 读取带有 Pandas 的 Parquet 元数据 - Read parquet metadata with pandas from Google Cloud Storage java.io.IOException:尝试从 IBM Cloud Object Storage 的 Spark 集群读取拼花文件时,无法读取文件 FileStatus 的页脚 - java.io.IOException: Could not read footer for file FileStatus when trying to read parquet file from Spark cluster from IBM Cloud Object Storage 使用 python 将文件上传到 IBM 云 object 存储时出错 - error while uploading file to IBM cloud object storage using python 如何从python谷歌云函数访问谷歌云存储中文件的文件元数据 - How to access file metadata, for files in google cloud storage, from a python google cloud function 如何从存储桶中获取所有文件 - IBM Cloud Object Storage? - How to get all files from a Bucket - IBM Cloud Object Storage? 如何从 Google Cloud Storage 中获取特定的对象元数据信息? - How to grab specific object metadata info from Google Cloud Storage? 使用Dask从谷歌云存储中读取镶木地板文件 - Using Dask to read parquet files from a google cloud storage 如何使用 Boto3 从 S3 将压缩的镶木地板文件读入 Python? - How do I read a gzipped parquet file from S3 into Python using Boto3? 如何使用 python 从 s3 读取按日期文件夹分区的镶木地板文件? - How to read parquet file partitioned by date folder from s3 using python? 在Python Pandas中使用read_parquet从AWS S3读取Parquet文件时出现分段错误 - Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM