如何为Redshift频谱中的嵌套Parquet类型创建外部表

Question

I know redshift and redshift spectrum doesn't support nested type, but I want to know is there any trick that we can bypass that limitation and query our nested data in S3 with Redshift Spectrum? 我知道redshift和redshift频谱不支持嵌套类型，但是我想知道是否有任何技巧可以绕过该限制并使用Redshift Spectrum在S3中查询嵌套数据？ In this post the guy shows how we can do it for JSON files, but it's not the same for Parquet. 在这篇文章中，这个家伙展示了我们如何对JSON文件执行此操作，但是对于Parquet则不同。 Do we have any other trick that can be applied on Parquet file? 我们还有其他可用于Parquet文件的技巧吗？

The actual Schema is something like this: (extracted by AWS-Glue crawler) 实际的架构是这样的：（由AWS-Glue搜寻器提取）

CREATE EXTERNAL TABLE `parquet_nested`(
  `event_time` string, 
  `event_id` string, 
  `user` struct<ip_address:string,id:string,country:string>, 
  `device` struct<platform:string,device_id:string,user_agent:string>
  )
PARTITIONED BY ( 
  `partition_0` string, 
  `partition_1` string, 
  `partition_2` string, 
  `partition_3` string, 
  `partition_4` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://...'

Answer 1

@Am1rr3zA Now, RedShift spectrum supports querying nested data set. @ Am1rr3zA现在，RedShift频谱支持查询嵌套数据集。 It supports not only JSON but also compression formats, like parquet, orc. 它不仅支持JSON，还支持诸如木地板，兽人之类的压缩格式。 Here, is the reference sample from AWS 这是来自AWS的参考样本

I have created external tables pointing to parquet files in my s3 bucket. 我在s3存储桶中创建了指向镶木地板文件的外部表。 So it's possible. 因此有可能。

Give this script a try 试试这个脚本

CREATE EXTERNAL TABLE spectrum.parquet_nested (
   event_time varchar(20),
   event_id varchar(20),
   user 
 struct<ip_address:varchar(20),id:varchar(20),country:varchar(20)>,
   device 
 struct<platform:varchar(20),device_id:varchar(20),user_agent:varchar(20)>
    )
    STORED AS PARQUET
    LOCATION 's3://BUCKETNAME/parquetFolder/';

Hope, this saves your trick adventure :) 希望这可以节省您的招数冒险:)

如何为Redshift频谱中的嵌套Parquet类型创建外部表

问题描述

1 个解决方案

解决方案1
1 2019-02-12 20:03:01

如何为Redshift频谱中的嵌套Parquet类型创建外部表

问题描述

1 个解决方案

解决方案1 1 2019-02-12 20:03:01

解决方案1
1 2019-02-12 20:03:01