简体   繁体   English

从S3读取csv并使用AWS Lambda插入MySQL表

[英]Reading csv from S3 and inserting into a MySQL table with AWS Lambda

I'm trying to automate the loading of a csv into a MySQL table when it's received into a S3 bucket. 我正在尝试自动将csv加载到MySQL表中,当它被收到S3存储桶时。

My strategy is that S3 launches an event when it receives a file into a specified bucket (let's call it 'bucket-file'). 我的策略是S3在收到指定存储桶中的文件时启动一个事件(我们称之为'bucket-file')。 This is event is notified to an AWS Lambda function that will download and process the file inserting each row into a MySql table (let's call it 'target_table'). 这是事件被通知给AWS Lambda函数,该函数将下载并处理将每行插入MySql表的文件(让我们称之为'target_table')。

We have to take into consideration that RDS is in a VPC. 我们必须考虑到RDS在VPC中。

The current permission configuration of the bucket is: 存储桶的当前权限配置为:

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "PublicReadForGetBucketObjects",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::bucket-file/*"
        }
    ]
}

I've created a role with the following policies, AmazonS3FullAccess and AWSLambdaVPCAccessExecutionRole attached to the AWS Lambda function. 我创建了一个角色,其中包含以下策略:附加到AWS Lambda函数的AmazonS3FullAccess和AWSLambdaVPCAccessExecutionRole。

The lambda code is: lambda代码是:

from __future__ import print_function
import boto3
import logging
import os
import sys
import uuid
import pymysql
import csv
import rds_config


rds_host  = rds_config.rds_host
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name


logger = logging.getLogger()
logger.setLevel(logging.INFO)

try:
    conn = pymysql.connect(rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)
except Exception as e:
    logger.error("ERROR: Unexpected error: Could not connect to MySql instance.")
    logger.error(e)
    sys.exit()

logger.info("SUCCESS: Connection to RDS mysql instance succeeded")

s3_client = boto3.client('s3')

def handler(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key'] 
    download_path = '/tmp/{}{}'.format(uuid.uuid4(), key)

    s3_client.download_file(bucket, key,download_path)

    csv_data = csv.reader(file( download_path))

    with conn.cursor() as cur:
        for idx, row in enumerate(csv_data):

            logger.info(row)
            try:
                cur.execute('INSERT INTO target_table(column1, column2, column3)' \
                                'VALUES("%s", "%s", "%s")'
                                , row)
            except Exception as e:
                logger.error(e)

            if idx % 100 == 0:
                conn.commit()

        conn.commit()

    return 'File loaded into RDS:' + str(download_path)

I've been testing the function and S3 sends the event when a file is uploaded, Lambda connects to the RDS instance and get the notification. 我一直在测试该功能,S3在上传文件时发送事件,Lambda连接到RDS实例并获取通知。 I've checked that the bucket name is 'bucket-file' and the filename is also right. 我已经检查过桶名称是'bucket-file',文件名也是正确的。 The problem is when the function reaches the line s3_client.download_file(bucket, key,download_path) where it gets stuck until the lamdba expiration time is reached. 问题是当函数到达行s3_client.download_file(bucket, key,download_path) ,它会一直停滞,直到达到lamdba到期时间。

Watching the logs it says: 看着日志,它说:

[INFO]  2017-01-24T14:36:52.102Z    SUCCESS: Connection to RDS mysql instance succeeded
[INFO]  2017-01-24T14:36:53.282Z    Starting new HTTPS connection (1): bucket-files.s3.amazonaws.com
[INFO]  2017-01-24T14:37:23.223Z    Starting new HTTPS connection (2): bucket-files.s3.amazonaws.com
2017-01-24T14:37:48.684Z Task timed out after 60.00 seconds

I've also read that if you are working within a VPC, in order to access S3 bucket you have to create a VPC Endpoint that grants access to S3 for this subnet. 我还读到,如果您在VPC中工作,为了访问S3存储桶,您必须创建一个VPC端点,以便为该子网授予对S3的访问权限。 I've also tried this solution but the result is the same. 我也试过这个解决方案,结果是一样的。

I'd appreciate some ideas. 我很欣赏一些想法。

Thanks in advance! 提前致谢!

I finally got it! 我终于明白了!

The problem was the VPC issue. 问题是VPC问题。 As I said, I created an VPC Endpoint to make S3 service accessible form my VPC, but I had my route table wrongly configured. 正如我所说,我创建了一个VPC端点,使我的VPC可以访问S3服务,但我的路由表配置错误。

So, in conclusion, if you are working in a VPC with lambda and you want to access to S3, you need to create a VPC Endpoint. 因此,总而言之,如果您在使用lambda的VPC中工作并且想要访问S3,则需要创建VPC端点。 Besides, if you want to access any other internet service outside your VPC, you need to configure a NAT Gateway. 此外,如果要访问VPC外部的任何其他Internet服务,则需要配置NAT网关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 AWS lambda function 中的 s3 存储桶中读取 .mdb 或 .accdb 文件并使用 python 将其转换为 excel 或 csv - Reading .mdb or .accdb file from s3 bucket in AWS lambda function and converting it into excel or csv using python 从 S3 Python AWS Lambda 读取 XML - Reading XML from S3 Python AWS Lambda 如何从 AWS Lambda 中的 s3 存储桶读取 csv 文件? - How to read csv file from s3 bucket in AWS Lambda? AWS Lambda:使用Python从s3存储桶中读取csv文件尺寸,而无需使用Pandas或CSV包 - AWS Lambda: read csv file dimensions from an s3 bucket with Python without using Pandas or CSV package 如何在 aws lambda 中从 aws s3 读取 csv 文件 - How do I read a csv file from aws s3 in aws lambda 使用Lambda从S3读取数据 - Reading data from S3 using Lambda 在lambda中使用pandas从s3读取excel文件并转换为csv - Reading excel file from s3 using pandas in lambda and convert to csv 使用 lambda 和 boto3 从 S3 存储桶中读取 csv 个文件的子集 - Reading a subset of csv files from S3 bucket using lambda and boto3 AWS Lambda - 将 S3 中的多个 CSV 文件合并为一个文件 - AWS Lambda - Combine multiple CSV files from S3 into one file 将 CSV 从 S3 上传到 AWS Lambda 时出错:TypeError: 'module' object is not callable - Error uploading CSV from S3 to AWS Lambda: TypeError: 'module' object is not callable
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM