[英]How to read Parquet file's metadata from IBM Cloud Object Storage in Python?
How to read a Parquet file's metadata (column names with types) from IBM COS in Python? 如何使用Python从IBM COS读取Parquet文件的元数据(带有类型的列名称)?
The only way I have found: 我发现的唯一方法:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem(anon=False, key='xxx', secret='xxx',
client_kwargs={'endpoint_url':
"https://s3-api.us-geo.objectstorage.softlayer.net"}
schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).read().schema
But it reads the whole file (I think). 但是它会读取整个文件(我认为)。
May be there is another approach to get the metadata from the Parquet file located in IBM COS? 也许还有另一种方法可以从位于IBM COS的Parquet文件中获取元数据?
If I use 如果我用
schema = pq.ParquetDataset("bucket_name/file", filesystem=s3).schema
It returns different data types. 它返回不同的数据类型。 For Strings: BYTE_ARRAY
对于字符串:BYTE_ARRAY
and for Timestamp: INT96 对于时间戳:INT96
Strange... 奇怪...
解决了:
schema = pq.ParquetDataset(bucket, filesystem=s3).schema.to_arrow_schema()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.