简体繁体中英

AWS Data Lake Ingest

原文 2017-09-21 19:01:33 9 1 excel/ amazon-web-services/ amazon-s3/ amazon-athena/ data-lake

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?

I have gone through the " Data Lake Foundation on the AWS Cloud " document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.

Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.

Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?

Thank you for any light you can shed on this topic.

-Guido

1 answers

I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.

Formats Supported by Redshift Spectrum:

http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.

Athena Supported File Formats:

http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.

You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.

Uploading Files to S3:

If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.

Querying Big Data with Athena:

You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.

http://docs.aws.amazon.com/athena/latest/ug/language-reference.html

Querying Big Data with Redshift Spectrum:

Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.

Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.

SQL WorkBench: http://www.sql-workbench.net/

Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html

Copying data to Redshift:

Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.

Copy Command Examples:

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

Redshift Cluster Size and Number of Nodes:

Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)

I have a very good experience with Redshift, getting up to the speed might take sometime.

Hope it helps.

Manage Authorization To folders in Azure Data Lake from Excel

Azure Data Lake Excel Export To CSV as Same Folder / Path

How to decide between Azure Data Lake vs Azure SQL vs Azure Data Lake Analytics vs Azure SQL VM?

How can I transform data in xlsx file removing merge in cells and transposing some columns to ingest data in SQL Server using SSIS?

AWS data into Excel?

Ingest multiple excel files to MySQL using query

How to return binary data from AWS Lambda written in Java

Is it possible to Import data in AWS DynamoDB from Excel file?

Strategies reading large data from Microsoft CSV and writing to Microsoft Excel in AWS

Can we copy image from AWS S3 and write it into excel file (S3) without storing the data locally using Python?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Manage Authorization To folders in Azure Data Lake from Excel Azure Data Lake Excel Export To CSV as Same Folder / Path How to decide between Azure Data Lake vs Azure SQL vs Azure Data Lake Analytics vs Azure SQL VM? How can I transform data in xlsx file removing merge in cells and transposing some columns to ingest data in SQL Server using SSIS? AWS data into Excel? Ingest multiple excel files to MySQL using query How to return binary data from AWS Lambda written in Java Is it possible to Import data in AWS DynamoDB from Excel file? Strategies reading large data from Microsoft CSV and writing to Microsoft Excel in AWS Can we copy image from AWS S3 and write it into excel file (S3) without storing the data locally using Python?

Related Tags

AWS Data Lake Ingest

Question

1 answers

solution1 3 ACCPTED 2017-09-21 20:00:47

solution1
3 ACCPTED 2017-09-21 20:00:47