简体繁体 English

AWS Data Lake提取

[英]AWS Data Lake Ingest

原文 2017-09-21 19:01:33 5 1 excel/ amazon-web-services/ amazon-s3/ amazon-athena/ data-lake

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake? 您是否需要使用胶水摄取excel和其他专有格式，或者允许胶水在您的s3存储桶中爬行以在数据湖中使用这些数据格式？

I have gone through the " Data Lake Foundation on the AWS Cloud " document and am left scratching my head about getting data into the lake. 我浏览了“ AWS Cloud上的Data Lake Foundation ”文档，然后开始将数据导入湖中。 I have a Data Provider with a large set of data stored on their system as excel and access files. 我有一个数据提供程序，其中有大量数据作为excel和访问文件存储在其系统上。

Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools. 根据流程，他们会将数据上载到Submit s3存储桶中，这将引发一系列操作，但是没有其他数据格式可以与其他工具一起使用。

Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum? 使用这些文件是否需要在存储桶中提交的数据上使用胶水，或者是否有其他方法可以使这些数据可供其他工具（如Athena和redshift Spectrum）使用？

Thank you for any light you can shed on this topic. 感谢您对本主题的理解。

-Guido -Guido

1 个解决方案

I'm not seeing that can take excel data directly to Data Lake. 我看不到可以将excel数据直接带到Data Lake。 You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake. 在加载到Data Lake中之前，您可能需要转换为CSV / TSV / Json或其他格式。

Formats Supported by Redshift Spectrum: Redshift Spectrum支持的格式：

http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now. http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html-到目前为止，我仍然没有看到Excel。

Athena Supported File Formats: 雅典娜支持的文件格式：

http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here. http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html-我看不到这里也不支持Excel。

You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself. 您需要将文件上传到S3，以使用Athena或Redshift Spectrum甚至Redshift存储本身。

Uploading Files to S3: 将文件上传到S3：

If you have bigger files, you need to use S3 multipart upload to upload quicker. 如果文件更大，则需要使用S3分段上传来更快地上传。 If you want more speed, you need to use S3 accelerator to upload your files. 如果要提高速度，则需要使用S3加速器上载文件。

Querying Big Data with Athena: 使用雅典娜查询大数据：

You can create external tables with Athena from S3 locations. 您可以从S3位置使用Athena创建外部表。 Once you create external tables, use Athena Sql reference to query your data. 创建外部表后，使用Athena Sql参考查询数据。

http://docs.aws.amazon.com/athena/latest/ug/language-reference.html http://docs.aws.amazon.com/athena/latest/ug/language-reference.html

Querying Big Data with Redshift Spectrum: 使用Redshift Spectrum查询大数据

Similar to Athena, you can create external tables with Redshift. 与Athena相似，您可以使用Redshift创建外部表。 Start querying those tables and get the results on Redshift. 开始查询这些表并在Redshift上获取结果。

Redshift has lot of commercial tools, I use SQL Workbench. Redshift有很多商业工具，我使用SQL Workbench。 It is free open source and rock solid, supported by AWS. 它是受AWS支持的免费开源和坚如磐石。

SQL WorkBench: http://www.sql-workbench.net/ SQL WorkBench： http : //www.sql-workbench.net/

Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html 将WorkBench连接到Redshift： http : //docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html

Copying data to Redshift: 将数据复制到Redshift：

Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift. 同样，如果要将数据存储到Redshift，则可以使用copy命令从S3中提取数据并将其加载到Redshift。

Copy Command Examples: 复制命令示例：

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

Redshift Cluster Size and Number of Nodes: Redshift群集大小和节点数：

Before creating Redshift Cluster, check for required size and number of nodes needed. 在创建Redshift Cluster之前，请检查所需的大小和所需的节点数。 More number of nodes gets query parallely running. 更多数量的节点使查询并行运行。 One more important factor is how well your data is distributed. 一个更重要的因素是您的数据分布情况。 (Distribution key and Sort keys) （分配键和排序键）

I have a very good experience with Redshift, getting up to the speed might take sometime. 我在Redshift方面有很好的经验，赶上速度可能要花一些时间。

Hope it helps. 希望能帮助到你。