简体   繁体   中英

How to read Parquet file's metadata from IBM Cloud Object Storage in Python?

How to read a Parquet file's metadata (column names with types) from IBM COS in Python?

The only way I have found:

           import pyarrow.parquet as pq
           import s3fs
           s3 = s3fs.S3FileSystem(anon=False, key='xxx', secret='xxx',
                   client_kwargs={'endpoint_url':
                                      "https://s3-api.us-geo.objectstorage.softlayer.net"}

           schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).read().schema

But it reads the whole file (I think).

May be there is another approach to get the metadata from the Parquet file located in IBM COS?

If I use

       schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).schema

It returns different data types. For Strings: BYTE_ARRAY

and for Timestamp: INT96

Strange...

解决了:

schema = pq.ParquetDataset(bucket, filesystem=s3).schema.to_arrow_schema()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM