从命令行检查 S3 中的 Parquet

Question

I can download a single snappy.parquet partition file with:我可以使用以下命令下载单个 snappy.parquet 分区文件：

aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet

And then use:然后使用：

parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet

But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file.但我不想下载该文件，也不想指定特定的 snappy.parquet 文件。 Instead the prefix: "s3://bucket/my-data.parquet"取而代之的是前缀：“s3://bucket/my-data.parquet”

Also what if the schema is different in different row groups across different partition files?另外，如果跨不同分区文件的不同行组中的架构不同怎么办？

Following instructions here I downloaded a jar file and ran按照此处的说明，我下载了一个 jar 文件并运行

hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/

But this resulted in error: No FileSystem for schema "s3".但这导致错误：模式“s3”没有文件系统。

This answer seems promising, but only for reading from HDFS. Any solution for S3?这个答案似乎很有希望，但仅适用于从 HDFS 读取。S3 有任何解决方案吗？

Answer 1

I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.我编写了工具clidb来帮助完成这种“快速查看 S3 中的镶木地板文件”任务。

You should be able to do:你应该能够做到：

pip install "clidb[extras]"
clidb s3://bucket/

and then click to load parquet files as views to inspect and run SQL against.然后单击以加载镶木地板文件作为视图以检查并运行 SQL。

从命令行检查 S3 中的 Parquet

问题描述

1 个解决方案

解决方案1
1 2022-04-10 22:38:44

从命令行检查 S3 中的 Parquet

问题描述

1 个解决方案

解决方案1 1 2022-04-10 22:38:44

解决方案1
1 2022-04-10 22:38:44