简体   繁体   English

分区依据的 Amazon Redshift Spectrum 不返回结果

[英]Amazon Redshift Spectrum with partitioned by does not return results

Given an S3 bucket partitioned by date in this way:给定一个按日期分区的 S3 存储桶:

year
|___month
    |___day
        |___file_*.parquet

I am trying to create a table in amazon redshift Spectrum with this command:我正在尝试使用以下命令在 amazon redshift Spectrum 中创建一个表:

create external table spectrum.visits(
    ip varchar(100),
  user_agent varchar(2000),
  url varchar(10000),
  referer varchar(10000),
  session_id char(32),
  store_id int,
  category_id int,
  page_id int,
  product_id int,
  customer_id int,
  hour int
)
partitioned by (year char(4), month varchar(2), day varchar(2))
stored as parquet
location 's3://visits/visits-parquet/';

Although an error message is not thrown, the results of the queries are always null, ie, do not return results.虽然没有抛出错误信息,但查询的结果始终是 null,即不返回结果。 The bucket is not null.桶不是 null。 Does someone knows want am I doing wrong?有人知道我做错了吗?

When an External Table is created in Amazon Redshift Spectrum, it does not scan for existing partitions .在 Amazon Redshift Spectrum 中创建外部表时,它不会扫描现有分区 Therefore, Redshift is not aware that they exist.因此,Redshift 不知道它们的存在。

You will need to execute an ALTER TABLE... ADD PARTITION command for each existing partition.您将需要为每个现有分区执行ALTER TABLE... ADD PARTITION命令。

(Amazon Athena has a MSCK REPAIR TABLE option, but Redshift Spectrum does not.) (Amazon Athena 有一个MSCK REPAIR TABLE选项,但 Redshift Spectrum 没有。)

As I can't comment on people solutions, I needed to add another one.由于我无法评论人们的解决方案,我需要添加另一个解决方案。

I would like to point that if your spectrum table comes from Amazon Glue Data Catalog you don't need to manually add partitions to tables, you can have a crawler update partitions on the data catalog and the changes will reflect on spectrum.我想指出,如果您的频谱表来自 Amazon Glue 数据目录,您不需要手动将分区添加到表中,您可以让爬虫更新数据目录上的分区,并且更改将反映在频谱上。

One can create external table in Athena & run msck repair on it.可以在 Athena 中创建外部表并在其上运行 msck repair。 Make sure you add "/" at the end of the location.确保在位置末尾添加“/”。 Then create external schema in redshift.然后在 redshift 中创建外部模式。 This solved my problem of result being showing blank.这解决了我的结果显示空白的问题。 Alternatively you can run Glue crawler on Athena database, that will generate partitions automatically.或者,您可以在 Athena 数据库上运行 Glue 爬虫,它会自动生成分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM