如何在 AWS Glue 作业中从 S3 中逐行读取 CSV 文件

Question

Hi I am very new to AWS.嗨，我对 AWS 很陌生。

I am trying to retrieve a 5gb csv file that I have stored in a s3 bucket, do ETL on it and load it into a DynamoDB table using AWS Glue.我正在尝试检索存储在 s3 存储桶中的 5gb csv 文件，对其进行 ETL 并使用 AWS Glue 将其加载到 DynamoDB 表中。 My glue job is pure python bash shell not using spark.我的胶水工作是纯 python bash shell 不使用火花。

My problem is that when I try to retrieve the file.我的问题是当我尝试检索文件时。 I am getting File not found exception.我收到文件未找到异常。 Here is my code:这是我的代码：

import boto3
import logging
import csv
import s3fs

from boto3 import client
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError

csv_file_path = 's3://my_s3_bucket/mycsv_file.csv'

A few lines down within my class.......:我的 class 中的几行......：

with open(self.csv_file_path, "r") as input:
       csv_reader = csv.reader(input, delimiter='^', quoting=csv.QUOTE_NONE)

       for row in csv_reader:

within the with open function is where I get file not found.在打开的 function 中是我找不到文件的地方。 Even though it is there.即使它在那里。 I really do not want to use pandas.我真的不想使用 pandas。 Weve had problems working with pandas within glue.我们在胶水中使用 pandas 时遇到问题。 Since this a 5gb file I cant store in memory thats why im trying to open it and read it row by row.由于这是一个 5gb 文件，我无法将其存储在 memory 中，这就是为什么我试图打开它并逐行读取它的原因。

I would really appreciate the help on this.我真的很感激这方面的帮助。

Also I have the correct IAM glue permissions setup and everything.我也有正确的 IAM 胶水权限设置和一切。

Answer 1

I figured it out我想到了

you have to use the s3 module from boto你必须使用 boto 的 s3 模块

s3 = boto3.client('s3')

file = s3.get_object(Bucket='bucket_name', Key='file_name')

lines = file['Body'].read().decode('utf-8').splitlines(True)

csv_reader = csv.reader(lines, delimiter=',', quoting=csv.QUOTE_NONE)

and then just create a for loop for the csv reader然后为 csv 阅读器创建一个 for 循环

如何在 AWS Glue 作业中从 S3 中逐行读取 CSV 文件

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-04-07 01:16:35

如何在 AWS Glue 作业中从 S3 中逐行读取 CSV 文件

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-04-07 01:16:35

解决方案1
3 已采纳 2020-04-07 01:16:35