使用 boto 和 pandas 從 aws s3 讀取 csv 文件

Question

我已經通讀了這里和這里的可用答案，但這些都沒有幫助。

我正在嘗試從S3存儲桶中讀取csv對象，並且能夠使用以下代碼成功讀取數據。

srcFileName="gossips.csv"
def on_session_started():
  print("Starting new session.")
  conn = S3Connection()
  my_bucket = conn.get_bucket("randomdatagossip", validate=False)
  print("Bucket Identified")
  print(my_bucket)
  key = Key(my_bucket,srcFileName)
  key.open()
  print(key.read())
  conn.close()

on_session_started()

但是，如果我嘗試使用 Pandas 作為數據框讀取同一個對象，則會出現錯誤。 最常見的是S3ResponseError: 403 Forbidden

def on_session_started2():
  print("Starting Second new session.")
  conn = S3Connection()
  my_bucket = conn.get_bucket("randomdatagossip", validate=False)
  #     url = "https://s3.amazonaws.com/randomdatagossip/gossips.csv"
  #     urllib2.urlopen(url)

  for line in smart_open.smart_open('s3://my_bucket/gossips.csv'):
     print line
  #     data = pd.read_csv(url)
  #     print(data)

on_session_started2()

我究竟做錯了什么？ 我使用的是 python 2.7，不能使用 Python 3。

Answer 1

這是我為成功從 S3 上的csv讀取df所做的工作。

import pandas as pd
import boto3

bucket = "yourbucket"
file_name = "your_file.csv"

s3 = boto3.client('s3') 
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket= bucket, Key= file_name) 
# get object and file (key) from bucket

initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word

Answer 2

這對我有用。

import pandas as pd
import boto3
import io

s3_file_key = 'data/test.csv'
bucket = 'data-bucket'

s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=s3_file_key)

initial_df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Answer 3

也許你可以嘗試使用 pandas read_sql 和 pyathena：

from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir='s3://bucket/folder',region_name='region')
df = pd.read_sql('select * from database.table', conn)

使用 boto 和 pandas 從 aws s3 讀取 csv 文件

問題描述

3 個解決方案

解決方案1
27 已采納 2017-05-02 13:12:19

解決方案2
16 2018-03-07 18:15:38

解決方案3
1 2020-10-27 00:20:00

使用 boto 和 pandas 從 aws s3 讀取 csv 文件

問題描述

3 個解決方案

解決方案1 27 已采納 2017-05-02 13:12:19

解決方案2 16 2018-03-07 18:15:38

解決方案3 1 2020-10-27 00:20:00

解決方案1
27 已采納 2017-05-02 13:12:19

解決方案2
16 2018-03-07 18:15:38

解決方案3
1 2020-10-27 00:20:00