简体   繁体   English

从命令行检查 S3 中的 Parquet

[英]Inspect Parquet in S3 from Command Line

I can download a single snappy.parquet partition file with:我可以使用以下命令下载单个 snappy.parquet 分区文件:

aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet

And then use:然后使用:

parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet

But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file.但我不想下载该文件,也不想指定特定的 snappy.parquet 文件。 Instead the prefix: "s3://bucket/my-data.parquet"取而代之的是前缀:“s3://bucket/my-data.parquet”

Also what if the schema is different in different row groups across different partition files?另外,如果跨不同分区文件的不同行组中的架构不同怎么办?

Following instructions here I downloaded a jar file and ran按照此处的说明,我下载了一个 jar 文件并运行

hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/

But this resulted in error: No FileSystem for schema "s3".但这导致错误:模式“s3”没有文件系统。

This answer seems promising, but only for reading from HDFS. Any solution for S3?这个答案似乎很有希望,但仅适用于从 HDFS 读取。S3 有任何解决方案吗?

I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.我编写了工具clidb来帮助完成这种“快速查看 S3 中的镶木地板文件”任务。

You should be able to do:你应该能够做到:

pip install "clidb[extras]"
clidb s3://bucket/

and then click to load parquet files as views to inspect and run SQL against.然后单击以加载镶木地板文件作为视图以检查并运行 SQL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM