简体   繁体   English

如何使用 Spark Session 列出 S3 存储桶中的文件?

[英]How to list files in S3 bucket using Spark Session?

Is it possible to list all of the files in given S3 path (ex: s3://my-bucket/my-folder/*.extension) using a SparkSession object?是否可以使用 SparkSession object 列出给定 S3 路径(例如:s3://my-bucket/my-folder/*.extension)中的所有文件?

You can use Hadoop API for accessing files on S3 (Spark uses it as well):您可以使用 Hadoop API 访问 S3 上的文件(Spark 也使用它):

import java.net.URI
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration

val path = "s3://somebucket/somefolder"
val fileSystem = FileSystem.get(URI.create(path), new Configuration())
val it = fileSystem.listFiles(new Path(path), true)
while (it.hasNext()) {
  ...
}

Lou Zell was very close, The below ended up working on ADLS2. Lou Zell非常接近,下面最终在 ADLS2 上工作。 but I'm putting it here because of the Py4J magic: Note that the noopcache causes the job to be run twice.但我把它放在这里是因为 Py4J 的魔力:请注意,noopcache 会导致作业运行两次。 Once when the index is created and once when listfiles is called: Might write a blogpost on this:一次创建索引和一次调用 listfiles 时:可能会写一篇关于此的博文:

import os

base_path = "/mnt/my_data/"
glob_pattern = "*"
sc = spark.sparkContext
hadoop_base_path = sc._jvm.org.apache.hadoop.fs.Path(base_path)
paths = sc._jvm.PythonUtils.toSeq([hadoop_base_path])

noop_cache_clazz = sc._jvm.java.lang.Class.forName("org.apache.spark.sql.execution.datasources.NoopCache$")
ff = noop_cache_clazz.getDeclaredField("MODULE$")
noop_cache = ff.get(None)

in_memory_file_index = sc._jvm.org.apache.spark.sql.execution.datasources.InMemoryFileIndex(
    spark._jsparkSession,
    paths,
    sc._jvm.PythonUtils.toScalaMap({}),
    sc._jvm.scala.Option.empty(),
    noop_cache,
    sc._jvm.scala.Option.empty(),
    sc._jvm.scala.Option.empty()
)
glob_path = sc._jvm.org.apache.hadoop.fs.Path(os.path.join(base_path, glob_pattern))
glob_paths = sc._jvm.PythonUtils.toSeq([glob_path])
# SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
status_list = in_memory_file_index.listLeafFiles(glob_paths)
path_list = []
iter = status_list.iterator()
while iter.hasNext():
    path_status = iter.next()
    path_list.append(str(path_status.getPath().toUri().getRawPath()))

path_list.sort()

print(path_list)

You can use input_file_name with dataframe, it will give you absolute file-path per row.您可以将input_file_name与数据帧一起使用,它会为您提供每行的绝对文件路径。

Following code will give you all the file paths.以下代码将为您提供所有文件路径。

spark.read.table("zen.intent_master").select(input_file_name).distinct.collect

I am assuming.我假设。 For your use case, you just want to read data from a set of files, with some regex, so then you can apply that in filter.对于您的用例,您只想从一组文件中读取数据,使用一些正则表达式,然后您可以将其应用到过滤器中。

For example,例如,

val df = spark.read.table("zen.intent_master").filter(input_file_name.rlike("your regex string"))

Approach 1方法一

For pyspark users, I've translated Michael Spector's answer (I'll leave it to you to decide if using this is a good idea):对于 pyspark 用户,我已经翻译了 Michael Spector 的回答(我会留给你来决定使用它是否是一个好主意):

sc = spark.sparkContext
myPath = f's3://my-bucket/my-prefix/'
javaPath = sc._jvm.java.net.URI.create(myPath)
hadoopPath = sc._jvm.org.apache.hadoop.fs.Path(myPath)
hadoopFileSystem = sc._jvm.org.apache.hadoop.fs.FileSystem.get(javaPath, sc._jvm.org.apache.hadoop.conf.Configuration())
iterator = hadoopFileSystem.listFiles(hadoopPath, True)

s3_keys = []
while iterator.hasNext():
    s3_keys.append(iterator.next().getPath().toUri().getRawPath())    

s3_keys now holds all file keys found at my-bucket/my-prefix s3_keys现在保存在my-bucket/my-prefix找到的所有文件密钥

Approach 2 Here is an alternative that I found ( hat tip to @forgetso):方法 2这是我发现的替代方法(@forgetso 的帽子提示):

myPath = 's3://my-bucket/my-prefix/*'
hadoopPath = sc._jvm.org.apache.hadoop.fs.Path(myPath)
hadoopFs = hadoopPath.getFileSystem(sc._jvm.org.apache.hadoop.conf.Configuration())
statuses = hadoopFs.globStatus(hadoopPath)

for status in statuses:
  status.getPath().toUri().getRawPath()
  # Alternatively, you can get file names only with:
  # status.getPath().getName()

Approach 3 (incomplete!)方法 3(不完整!)

The two approaches above do not use the Spark parallelism mechanism that would be applied on a distributed read.上述两种方法不使用将应用于分布式读取的 Spark 并行机制。 That logic looks private though.不过,这种逻辑看起来很私密。 See parallelListLeafFiles here . 在此处查看parallelListLeafFiles I have not found a way to compel pyspark do to a distributed ls on s3 without also reading the file contents.我还没有找到一种方法来强制 pyspark 对 s3 上的分布式ls执行操作,而无需读取文件内容。 I tried to use py4j to instantiate an InMemoryFileIndex , but can't get the incantation right.我尝试使用 py4j 来实例化InMemoryFileIndex ,但无法正确使用咒语。 Here is what I have so far if someone wants to pick it up from here:如果有人想从这里拿起它,这是我到目前为止所拥有的:

myPath = f's3://my-bucket/my-path/'
paths = sc._gateway.new_array(sc._jvm.org.apache.hadoop.fs.Path, 1)
paths[0] = sc._jvm.org.apache.hadoop.fs.Path(myPath)

emptyHashMap = sc._jvm.java.util.HashMap()
emptyScalaMap = sc._jvm.scala.collection.JavaConversions.mapAsScalaMap(emptyMap)

# Py4J is not happy with this:
sc._jvm.org.apache.spark.sql.execution.datasources.InMemoryFileIndex(
    spark._jsparkSession, 
    paths, 
    emptyScalaMap, 
    sc._jvm.scala.Option.empty() # Optional None
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 python 列出 S3 存储桶文件夹中的文件 - how to list files from a S3 bucket folder using python 使用 minio 列出 S3 存储桶中文件夹内的文件 - List out files inside folder in S3 bucket using minio 如何使用 AWS SDK 为 Python 递归列出 AWS S3 存储桶中的文件? - How to recursively list files in AWS S3 bucket using AWS SDK for Python? 如何将文件列表从 web 直接上传到 s3 存储桶 - How to upload a list of files from the web directly to s3 bucket 如何使用 write() 方法使用 spark 在 s3 存储桶中写入 txt 文件 - How to write txt file in s3 bucket with spark using write() method 从 s3 存储桶中获取具有特定子字符串的文件列表 - Get list of files from s3 bucket with a particular substring 如何通过 OPENROWSET (SQL Server) 列出 s3 存储桶文件夹中的所有镶木地板文件? - How to list all parquet files in s3 bucket folder via OPENROWSET (SQL Server)? 有没有办法使用 aws s3 ls cli 将 S3 存储桶名称添加到存储桶的递归列表中? - Is there a way to add the S3 bucket name to the recursive list of a bucket using aws s3 ls cli? Terraform 和 S3 - 如何将文件上传到现有存储桶 - Terraform and S3 - How to upload files to an existing bucket 如何从 amazon s3 存储桶中删除文件? - how to delete files from amazon s3 bucket?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM