简体繁体中英

Is there a way to list the directories in a using PySpark in a notebook?

原文 2020-06-28 07:08:31 5 1 amazon-s3/ pyspark/ apache-spark-sql/ cyberduck

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.

Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.

1 answers

Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.

wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

How to perform incremental load using AWS EMR (Pyspark) the right way?

How to list S3 objects in parallel in PySpark using flatMap()?

Is there a way to process a json file fom s3 bucket using pyspark without downloading?

Using pyspark on AWS EMR

Pyspark not using TemporaryAWSCredentialsProvider

How to Create list of filenames in an S3 directory using pyspark and/or databricks utils

S3 Python List nested sub directories

boto3 list ONLY directories on bucket

List directories in amazon S3 with AWS SDK

S3 boto3 list directories only

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to perform incremental load using AWS EMR (Pyspark) the right way? How to list S3 objects in parallel in PySpark using flatMap()? Is there a way to process a json file fom s3 bucket using pyspark without downloading? Using pyspark on AWS EMR Pyspark not using TemporaryAWSCredentialsProvider How to Create list of filenames in an S3 directory using pyspark and/or databricks utils S3 Python List nested sub directories boto3 list ONLY directories on bucket List directories in amazon S3 with AWS SDK S3 boto3 list directories only

Related Tags

Is there a way to list the directories in a using PySpark in a notebook?

Question

1 answers

solution1 0 2020-06-28 12:18:11

solution1
0 2020-06-28 12:18:11