简体   繁体   English

如何使用 python 中的 spark dataframe 从 AWS S3 读取镶木地板文件(pyspark)

[英]How to read parquet files from AWS S3 using spark dataframe in python (pyspark)

I'm trying to read some parquet files stored in a s3 bucket.我正在尝试读取存储在 s3 存储桶中的一些镶木地板文件。 I am using the following code:我正在使用以下代码:

s3 = boto3.resource('s3')

# get a handle on the bucket that holds your file
bucket = s3.Bucket('bucket_name')

# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key = 'file/key/083b661babc54dd89139449d15fa22dd.snappy.parquet')

# get the object
response = obj.get()

# read the contents of the file and split it into a list of lines
lines = response[u'Body'].read().split('\n')

When trying to execute the last line of code lines = response[u'Body'].read().split('\n') I'm getting the following error:当试图执行最后一行代码lines = response[u'Body'].read().split('\n')我收到以下错误:

TypeError: a bytes-like object is required, not 'str'

I'm not really sure how to solve this issue.我不确定如何解决这个问题。

Instead of boto3 I had to use the following code :我不得不使用以下代码,而不是 boto3:

myAccessKey = 'your key' 
mySecretKey = 'your key'

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

df = sqlContext.read.parquet("s3://bucket-name/path/")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 pyspark 将镶木地板文件(在 aws s3 中)存储到火花 dataframe - store parquet files (in aws s3) into a spark dataframe using pyspark 如何从S3读取镶木地板数据以激发数据帧Python? - How to read parquet data from S3 to spark dataframe Python? 使用 pyspark 到 pyspark dataframe 从 s3 位置读取镶木地板文件的文件夹 - Read a folder of parquet files from s3 location using pyspark to pyspark dataframe 如何使用 pyarrow 从 S3 读取镶木地板文件列表作为 pandas dataframe? - How to read a list of parquet files from S3 as a pandas dataframe using pyarrow? 在Python Pandas中使用read_parquet从AWS S3读取Parquet文件时出现分段错误 - Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas 如何在 python 中使用 awswrangler 从 S3 读取所有镶木地板文件 - How to read all parquet files from S3 using awswrangler in python 如何在 python 中使用 pyarrow 从 S3 读取分区的镶木地板文件 - How to read partitioned parquet files from S3 using pyarrow in python 当路径在数据框中列出时如何使用 pyspark 读取镶木地板文件 - How to read parquet files using pyspark when paths are listed in a dataframe 如何选择性地从AWS S3作为Dask Data Frame读取Parquet文件? - How to selectively read Parquet files from AWS S3 as a Dask Data Frame? 如何使用具有特定 AWS 配置文件的 dask 从 s3 读取镶木地板文件 - How to read parquet file from s3 using dask with specific AWS profile
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM