使用 spark 写入文件并使用 python 读取

Question

writing a file s3 using spark usually creates a directory with 11 files success and the other file name starts with name as part which has actual data in s3, how to load the same file using pandas dataframe since the file path changes because the file name Par for all 10 files with actual data varies in each run.使用 spark 写入文件 s3 通常会创建一个包含 11 个文件成功的目录，而另一个文件名以名称开头，其中包含 s3 中的实际数据，如何使用 pandas dataframe 加载相同的文件，因为文件路径会因为文件名 Par 而更改对于所有 10 个具有实际数据的文件，每次运行都会有所不同。

For example the file path at the time of writing:例如撰写本文时的文件路径：

df.colaesce.(10).write.path("s3://testfolder.csv")

The files stored in the directory are:存放在目录中的文件有：

- sucess
- part-00-*.parquet

I have a python job which reads the file to pandas dataframe我有一个 python 作业，它将文件读取到 pandas dataframe

pd.read(s3\\..........what is the path to specify here.................)

Answer 1

when writing files with spark, you cannot pass the name the file (you can, but you end up with what you described above).使用 spark 编写文件时，您不能传递文件的名称（可以，但最终会得到上述内容）。 if you want a single file to later load to pandas, you would do something like this:如果您希望稍后将单个文件加载到 pandas，您可以执行以下操作：

df.repartition(1).write.parquet(path="s3://testfolder/", mode='append')

The end result will be a single file in "s3://testfolder/" that starts with part-00-*.parquet .最终结果将是“s3://testfolder/”中以part-00-*.parquet开头的单个文件。 You can simply read that file in or rename the file to something specific before reading it in with pandas.在使用 pandas 读入之前，您可以简单地读入该文件或将文件重命名为特定内容。

Answer 2

Option 1: (Recommended)选项 1：（推荐）

You can use awswrangler .您可以使用awsrangler 。 Its a light weight tool to aid with the integration between Pandas/S3/Parquet.它是一种轻量级工具，可帮助 Pandas/S3/Parquet 之间的集成。 It lets you read in multiple files from the directory.它允许您从目录中读取多个文件。

pip install awswrangler

import awswrangler as wr

df = wr.s3.read_parquet(path='s3://testfolder/')

Option 2:选项 2：

############################## RETRIEVE KEYS FROM THE BUCKET ##################################

import boto3
import pandas as pd

s3 = boto3.client('s3')

s3_bucket_name = 'your bucket name'
prefix = 'path where the files are located'

response = s3.list_objects_v2(
    Bucket = s3_bucket_name, 
    Prefix = prefix 
)

keys = []
for obj in response['Contents']:
    keys.append(obj['Key'])
    
##################################### READ IN THE FILES  ####################################### 


df=[]
for key in keys:
    df.append(pd.read_parquet(path = 's3://' + s3_bucket_name + '/' + key, engine = 'pyarrow'))

使用 spark 写入文件并使用 python 读取

问题描述

2 个解决方案

解决方案1
0 2020-08-19 23:26:17

解决方案2
0 2020-08-21 14:35:49

使用 spark 写入文件并使用 python 读取

问题描述

2 个解决方案

解决方案1 0 2020-08-19 23:26:17

解决方案2 0 2020-08-21 14:35:49

解决方案1
0 2020-08-19 23:26:17

解决方案2
0 2020-08-21 14:35:49