如何使用Python从IBM Cloud Object Storage中读取Parquet文件的元数据？

Question

How to read a Parquet file's metadata (column names with types) from IBM COS in Python? 如何使用Python从IBM COS读取Parquet文件的元数据（带有类型的列名称）？

The only way I have found: 我发现的唯一方法：

           import pyarrow.parquet as pq
           import s3fs
           s3 = s3fs.S3FileSystem(anon=False, key='xxx', secret='xxx',
                   client_kwargs={'endpoint_url':
                                      "https://s3-api.us-geo.objectstorage.softlayer.net"}

           schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).read().schema

But it reads the whole file (I think). 但是它会读取整个文件（我认为）。

May be there is another approach to get the metadata from the Parquet file located in IBM COS? 也许还有另一种方法可以从位于IBM COS的Parquet文件中获取元数据？

If I use 如果我用

       schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).schema

It returns different data types. 它返回不同的数据类型。 For Strings: BYTE_ARRAY 对于字符串：BYTE_ARRAY

and for Timestamp: INT96 对于时间戳：INT96

Strange... 奇怪...

Answer 1

解决了：

schema = pq.ParquetDataset(bucket, filesystem=s3).schema.to_arrow_schema()

如何使用Python从IBM Cloud Object Storage中读取Parquet文件的元数据？

问题描述

1 个解决方案

解决方案1
0 2018-10-16 15:58:17

如何使用Python从IBM Cloud Object Storage中读取Parquet文件的元数据？

问题描述

1 个解决方案

解决方案1 0 2018-10-16 15:58:17

解决方案1
0 2018-10-16 15:58:17