简体   繁体   English

如何为Redshift频谱中的嵌套Parquet类型创建外部表

[英]How to create an external table for nested Parquet type in redshift spectrum

I know redshift and redshift spectrum doesn't support nested type, but I want to know is there any trick that we can bypass that limitation and query our nested data in S3 with Redshift Spectrum? 我知道redshift和redshift频谱不支持嵌套类型,但是我想知道是否有任何技巧可以绕过该限制并使用Redshift Spectrum在S3中查询嵌套数据? In this post the guy shows how we can do it for JSON files, but it's not the same for Parquet. 在这篇文章中,这个家伙展示了我们如何对JSON文件执行此操作,但是对于Parquet则不同。 Do we have any other trick that can be applied on Parquet file? 我们还有其他可用于Parquet文件的技巧吗?

The actual Schema is something like this: (extracted by AWS-Glue crawler) 实际的架构是这样的:(由AWS-Glue搜寻器提取)

CREATE EXTERNAL TABLE `parquet_nested`(
  `event_time` string, 
  `event_id` string, 
  `user` struct<ip_address:string,id:string,country:string>, 
  `device` struct<platform:string,device_id:string,user_agent:string>
  )
PARTITIONED BY ( 
  `partition_0` string, 
  `partition_1` string, 
  `partition_2` string, 
  `partition_3` string, 
  `partition_4` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://...'

@Am1rr3zA Now, RedShift spectrum supports querying nested data set. @ Am1rr3zA现在,RedShift频谱支持查询嵌套数据集。 It supports not only JSON but also compression formats, like parquet, orc. 它不仅支持JSON,还支持诸如木地板,兽人之类的压缩格式。 Here, is the reference sample from AWS 这是来自AWS参考样本

I have created external tables pointing to parquet files in my s3 bucket. 我在s3存储桶中创建了指向镶木地板文件的外部表。 So it's possible. 因此有可能。

Give this script a try 试试这个脚本

CREATE EXTERNAL TABLE spectrum.parquet_nested (
   event_time varchar(20),
   event_id varchar(20),
   user 
 struct<ip_address:varchar(20),id:varchar(20),country:varchar(20)>,
   device 
 struct<platform:varchar(20),device_id:varchar(20),user_agent:varchar(20)>
    )
    STORED AS PARQUET
    LOCATION 's3://BUCKETNAME/parquetFolder/';

Hope, this saves your trick adventure :) 希望这可以节省您的招数冒险:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 红移光谱 - 更新外部光谱表列类型 - Redshift spectrum - Updating external spectrum table column type AWS Redshift Spectrum-如何在外部表中获取s3文件名 - AWS Redshift Spectrum - how to get the s3 filenames in the external table 无法在红移光谱外部模式中创建视图 - cannot create a view in redshift spectrum external schema Redshift Spectrum - 在 CTE 中引用外部表? - Redshift Spectrum - Referencing an external table in a CTE? 如何显示 Redshift Spectrum(外部架构)GRANTS? - How to show Redshift Spectrum (external schema) GRANTS? 如何在红移光谱表中加载正确的数据? - How to load correct data in redshift spectrum table? 在Amazon Redshift Spectrum中加载外部表时如何跳过最后几条记录? - How to skip end few records when loading external table in Amazon Redshift Spectrum? 您如何通过 AWS Quicksight 连接到 Redshift Spectrum 上的外部架构/表? - How do you connect to an external schema/table on Redshift Spectrum through AWS Quicksight? 使用胶水数据目录中定义的外部表红移光谱 - Use external table redshift spectrum defined in glue data catalog AWS Redshift Spectrum 不适用于 apache 镶木地板文件 - AWS Redshift Spectrum not working with apache parquet files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM