简体   繁体   English

数据湖 AWS 无服务器 Amazon S3

[英]Data Lake AWS serverless Amazon S3

I trying to build a serverless data lake with Amazon Simple Storage Service (Amazon S3) as the primary data store.我尝试使用 Amazon Simple Storage Service (Amazon S3) 作为主要数据存储来构建无服务器数据湖。 Ingested data lands in an Amazon S3 bucket that we refer to as the raw zone.摄取的数据位于我们称为原始区域的 Amazon S3 存储桶中。 To make that data available, I have to catalog its schema in the AWS Glue Data Catalog.为了使该数据可用,我必须在 AWS Glue 数据目录中对其架构进行编目。

I do this using an AWS Lambda function invoked by an Amazon S3 trigger to start an AWS Glue crawler that catalogs the data.我使用由 Amazon S3 触发器调用的 AWS Lambda function 执行此操作,以启动对数据进行编目的 AWS Glue 爬虫。

When the crawler is finished creating the table definition, is invoke a second Lambda function using an Amazon CloudWatch Events rule.当爬虫完成创建表定义后,使用 Amazon CloudWatch Events 规则调用第二个 Lambda function。 This step starts an AWS Glue ETL job to process and output the data into another Amazon S3 bucket that we refer to as the processed zone.此步骤启动 AWS Glue ETL 作业以将数据处理并 output 将数据放入另一个 Amazon S3 存储桶,我们将其称为已处理区域。 The AWS Glue ETL job converts the data to Apache Parquet format and stores it in the processed S3 bucket AWS Glue ETL 作业将数据转换为 Apache Parquet 格式并将其存储在处理后的 S3 存储桶中

Lambda to run crawler: Lambda 运行爬虫:

var AWS = require('aws-sdk');
var glue = new AWS.Glue();
var sqs = new AWS.SQS();
    exports.handler = function(event, context,callback) {
        console.log(JSON.stringify(event, null, 3));
        if(event.Records.length > 0 && event.Records[0].eventSource == 'aws:sqs'){
            startCrawler('datacrawler', function(err2,data2){
                if(err2) callback(err2)
                else callback(null,data2)
            })
        }else{
        var dbName = 'datacatalog';
        var params = {
            DatabaseInput: {
                Name: dbName,
                Description: 'Rede Post database',
            }
        };
        glue.createDatabase(params, function(err, data) {
                var params1 = {
                    DatabaseName: dbName,
                    Name: 'datacrawler',
                    Role: 'service-role/rede-data-lake-GlueLabRole-1OI9OXN93676F',
                    Targets: {
                        S3Targets: [{ Path: 's3://rede-data-lake-raws3bucket-1qgllh1leebin/' }]
                    },
                    Description: 'crawler test'
                };
                glue.createCrawler(params1, function(err1, data1) {
                    startCrawler('datacrawler', function(err2,data2){
                        if(err2) callback(err2)
                        else callback(null,data2)
                    })
                });
        });
    };
};
function startCrawler(name,callback){
    var params = {
        Name: name,
    };
    glue.startCrawler(params, function(err, data) {
        if (err){
            console.log(JSON.stringify(err,null,3 ))
            var params1 = {
                MessageBody: 'retry',
                QueueUrl: 'https://sqs.us-east-2.amazonaws.com/094381036356/rede-data-lake-SQSqueue-1AWGW0PCYANIY'
            };
            sqs.sendMessage(params1, function(err1, data1) {
                if (err1) callback(err1);
                else     callback(null, data1)
            });
        }
        else{
            callback(null, data)
        }
    });
    }

Cloud Watch Event rule: Cloud Watch 事件规则:

{
  "detail-type": [
    "Glue Crawler State Change"
  ],
  "source": [
    "aws.glue"
  ],
  "detail": {
    "crawlerName": [
      "datacrawler"
    ],
    "state": [
      "Succeeded"
    ]
  }
}

Lambda To Run the Glue Job: Lambda 运行胶水作业:

var AWS = require('aws-sdk');
var sns = new AWS.SNS( { region: "us-east-2" });
var s3 = new AWS.S3();
var glue = new AWS.Glue({apiVersion: '2017-03-31'});
    exports.handler = function(event, context, callback) {
    console.log(JSON.stringify(event, null, 3));
        var params = {
            JobName: 'GlueSalesJob',
            Timeout: 20,
        };
        glue.startJobRun(params, function(err1, data1) {
            if (err1) {
                console.log(err1, err1.stack);}
            else {
                console.log(data1);}
        });
        console.log(JSON.stringify(event, null, 3));
    };

All working just fine when we deal with just one file and one glue job, I don't see how to scale it.当我们只处理一个文件和一个胶水作业时,一切都很好,我不知道如何缩放它。

Imagine that I have various different file arrive at raw zone, each file into a folder, for each one I have to run the AWS Glue crawler and the AWS Glue ETL job and stores it in one folder inside the processed zone bucket.想象一下,我有各种不同的文件到达原始区域,每个文件进入一个文件夹,对于每个文件,我必须运行 AWS Glue 爬虫和 AWS Glue ETL 作业并将其存储在已处理区域存储桶内的一个文件夹中。

Ex: SaleFile, installmentsFile, DebitFiles and etc…例如:销售文件、分期付款文件、借记文件等……

How could I call the second lambda passing the name of the Job that should run for each file?我如何调用第二个 lambda 传递应该为每个文件运行的作业的名称? Basically I need to identify the The file or the folder Ingested to call the appropriated Glue Job.基本上我需要识别文件或 Ingested 文件夹来调用适当的 Glue 作业。

Someone could help me to find a solution for this?有人可以帮我找到解决方案吗? I appreciate any help.我很感激任何帮助。 I'm very new with Amazon.我对亚马逊很陌生。

Good going: You are almost there :-)进展顺利:你快到了:-)

When you listen to 'crawler state change event', you will get event object in lambda.当您收听“爬虫 state 更改事件”时,您将在 lambda 中收到事件 object。 It has following structure它具有以下结构

"detail": {
    "crawlerName": "demo",
    ....
    ....
}

Use event["detail"]["crawlerName"] to get crawler name.使用event["detail"]["crawlerName"]获取爬虫名称。 Since each crawler is mapped with corresponding ETL job, you can start glue job by using this mapping.由于每个爬虫都映射到相应的 ETL 作业,因此您可以使用此映射启动胶水作业。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM