[英]How do I read row by row of a CSV file from S3 in AWS Glue Job
Hi I am very new to AWS.嗨,我对 AWS 很陌生。
I am trying to retrieve a 5gb csv file that I have stored in a s3 bucket, do ETL on it and load it into a DynamoDB table using AWS Glue.我正在尝试检索存储在 s3 存储桶中的 5gb csv 文件,对其进行 ETL 并使用 AWS Glue 将其加载到 DynamoDB 表中。 My glue job is pure python bash shell not using spark.我的胶水工作是纯 python bash shell 不使用火花。
My problem is that when I try to retrieve the file.我的问题是当我尝试检索文件时。 I am getting File not found exception.我收到文件未找到异常。 Here is my code:这是我的代码:
import boto3
import logging
import csv
import s3fs
from boto3 import client
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError
csv_file_path = 's3://my_s3_bucket/mycsv_file.csv'
A few lines down within my class.......:我的 class 中的几行......:
with open(self.csv_file_path, "r") as input:
csv_reader = csv.reader(input, delimiter='^', quoting=csv.QUOTE_NONE)
for row in csv_reader:
within the with open function is where I get file not found.在打开的 function 中是我找不到文件的地方。 Even though it is there.即使它在那里。 I really do not want to use pandas.我真的不想使用 pandas。 Weve had problems working with pandas within glue.我们在胶水中使用 pandas 时遇到问题。 Since this a 5gb file I cant store in memory thats why im trying to open it and read it row by row.由于这是一个 5gb 文件,我无法将其存储在 memory 中,这就是为什么我试图打开它并逐行读取它的原因。
I would really appreciate the help on this.我真的很感激这方面的帮助。
Also I have the correct IAM glue permissions setup and everything.我也有正确的 IAM 胶水权限设置和一切。
I figured it out我想到了
you have to use the s3 module from boto你必须使用 boto 的 s3 模块
s3 = boto3.client('s3')
file = s3.get_object(Bucket='bucket_name', Key='file_name')
lines = file['Body'].read().decode('utf-8').splitlines(True)
csv_reader = csv.reader(lines, delimiter=',', quoting=csv.QUOTE_NONE)
and then just create a for loop for the csv reader然后为 csv 阅读器创建一个 for 循环
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.