如何从 S3 的 Pyspark 子文件夹中创建一个新的 dataframe 和 CSV 文件

Question

Hi I'm very new to Pyspark and S3.您好，我是 Pyspark 和 S3 的新手。 I have problem at hand.我手头有问题。 I have a folder, which consists of subfolders and files and also files from the subfolder(all CSVs) i need to create a new dataframe or a csv file where i get contents of the files and create as a single file.我有一个文件夹，其中包含子文件夹和文件以及子文件夹中的文件（所有 CSV）我需要创建一个新的 dataframe 或 csv 文件，我在其中获取文件的内容并创建为单个文件。 Which later need to be read to a table in postgress稍后需要将其读取到 postgress 中的表中

Can anyone please help me.谁能帮帮我吗。 I have code in python, but not sure how to go about with pyspark and S3我在 python 中有代码，但不确定 go 与 pyspark 和 S3 的关系

Answer 1

Try with this option .试试这个选项。

recursiveFileLookup – recursively scan a directory for files. recursiveFileLookup – 递归扫描目录中的文件。 Using this option disables partition discovery.使用此选项禁用分区发现。

df = spark.read.option("header","true").option("recursiveFileLookup","true").csv("s3://path/to/root/")

如何从 S3 的 Pyspark 子文件夹中创建一个新的 dataframe 和 CSV 文件

问题描述

1 个解决方案

解决方案1
1 2020-08-19 07:13:26

如何从 S3 的 Pyspark 子文件夹中创建一个新的 dataframe 和 CSV 文件

问题描述

1 个解决方案

解决方案1 1 2020-08-19 07:13:26

解决方案1
1 2020-08-19 07:13:26