简体繁体 English

如何使用 Glue 数据目录创建 Redshift 表

[英]How to create a Redshift table using Glue Data Catalog

原文 2021-02-28 05:49:02 0 1 amazon-web-services/ pyspark/ amazon-redshift/ etl/ aws-glue-data-catalog

I'm developing ETL pipeline using AWS Glue .我正在使用AWS Glue开发ETL管道。 So I have a csv file that is transformed in many ways using PySpark , such as duplicate column, change data types, add new columns, etc. I ran a crawler with the data stores in S3 location, so it created Glue Table according to the given csv file.所以我有一个csv文件，它使用PySpark以多种方式进行转换，例如重复列，更改数据类型，添加新列等。我在 S3 位置运行了一个带有数据存储的爬虫，所以它根据给定csv文件。 I mean when I add a new column to the csv file, it will change the Glue Table accordingly when running the crawler.我的意思是当我在csv文件中添加一个新列时，它会在运行爬虫时相应地更改 Glue Table。

Now I want to do the same with Amazon Redshift , what I want to do is create a table in Redshift which is similar to the Glue table I mentioned earlier(created using csv ).现在我想对Amazon Redshift做同样的事情，我想做的是在 Redshift 中创建一个类似于我之前提到的 Glue 表（使用csv创建）的表。 A lot of answers explain about to create Redshift schemas manually.很多答案解释了手动创建 Redshift 模式。 I did the same, but when the data type changes I have to manually update it.我也这样做了，但是当数据类型发生变化时，我必须手动更新它。 When csv file changes Redhsift table must be updated accordingly.当csv文件更改时，Redhsift 表必须相应更新。

Can I do the same using crawlers?我可以使用爬虫做同样的事情吗？ I mean create a Redhsift table that is similar to the Glue Catalog Table?我的意思是创建一个类似于 Glue 目录表的 Redhsift 表？ So when data type change or column removed or added in csv file we can run a crawler, can we do this using crawler, or are there any other method that fulfills my need?因此，当数据类型更改或在csv文件中删除或添加列时，我们可以运行爬虫，我们可以使用爬虫来做到这一点，还是有其他方法可以满足我的需要？ This should be a fully automated ELT pipeline.这应该是一个全自动的 ELT 管道。

Any help would be greatly appreciated!任何帮助将不胜感激！

1 个解决方案

The answers for all your questions are a big tasks.你所有问题的答案都是一项艰巨的任务。 What I recommend is get right the concepts of every piece of the puzzle you want to put together.我的建议是弄清楚你想要拼凑的每一块拼图的概念。

The csv files apparently have flexibility, which you will not get in Redshift, it is because the columns aren't really typed, it is just text... and it is very slow. csv 文件显然具有灵活性，您不会在 Redshift 中获得，这是因为列没有真正输入，它只是文本......而且速度很慢。 I would recommend you to use parquet files.我建议您使用镶木地板文件。

Regarding Redshift, if your table isn't there, you just use spark to write the table, and it will be created, BUT... you will not be able to set DISTKEY, SORTKEY... it is used for temp tables normally.关于Redshift，如果你的表不存在，你只需使用spark写表，它就会被创建，但是......你将无法设置DISTKEY，SORTKEY......它通常用于临时表. If you have additional column, you don't need to create it manually, spark will do it.如果你有额外的列，你不需要手动创建它，spark 会做。 But change columns data type, it is not simple and you will not achieve it (easily) via ETL.但是更改列数据类型并不简单，而且您不会（容易）通过 ETL 实现它。

Finally the data catalog, it is just a schema, the metadata, mostly you use a table to create the metadata, not the metadata to create a table.最后是数据目录，它只是一个模式，元数据，大多数情况下您使用表来创建元数据，而不是元数据来创建表。