简体   繁体   English

根据 spark scala 中的文件夹名称重命名和移动 S3 文件

[英]Rename and Move S3 files based on their folders name in spark scala

I have spark output in a s3 folders and I want to move all s3 files from that output folder to another location ,but while moving I want to rename the files .我在 s3 文件夹中有 spark 输出,我想将所有 s3 文件从该输出文件夹移动到另一个位置,但在移动时我想重命名文件。

For example I have files in S3 folders like below例如,我在 S3 文件夹中有文件,如下所示

在此处输入图像描述

Now I want to rename all files and put into another directory,but the name of the files would be like below现在我想重命名所有文件并放入另一个目录,但文件的名称如下所示

Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.1.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.2.2017-10-18-0439.Full.txt
Fundamental.FinancialStatement.FinancialStatementLineItems.Japan.1971-BAL.3.2017-10-18-0439.Full.txt

Here Fundamental.FinancialStatementis constant in all the files 2017-10-18-0439 current date time .这里 Fundamental.FinancialStatement 在所有文件2017-10-18-0439当前日期时间中是常量。

This is what I have tried so far but not able to get folder name and loop through all files这是我迄今为止尝试过的,但无法获取文件夹名称并遍历所有文件

    import org.apache.hadoop.fs._

val src = new Path("s3://trfsmallfffile/Segments/output")
val dest = new Path("s3://trfsmallfffile/Segments/Finaloutput")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = src.getFileSystem(conf)
//val file = fs.globStatus(new Path("src/DataPartition=Japan/part*.gz"))(0).getPath.getName
//println(file)
val status = fs.listStatus(src)    

status.foreach(filename => {
               val a = filename.getPath.getName.toString()
                println("file name"+a)
                //println(filename)
             })

This gives me below output这给了我下面的输出

    file nameDataPartition=Japan
file nameDataPartition=SelfSourcedPrivate
file nameDataPartition=SelfSourcedPublic
file name_SUCCESS

This gives me folders details not files inside the folder.这给了我文件夹详细信息而不是文件夹内的文件。

Reference is taken from here Stack Overflow Refrence参考取自这里Stack Overflow Refrence

You are getting directory because you have sub dir level in s3 .您正在获取目录,因为您在 s3 中有子目录级别。

/*/* to go in subdir .

Try this尝试这个

import org.apache.hadoop.fs._

val src = new Path("s3://trfsmallfffile/Segments/Output/*/*")
val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = src.getFileSystem(conf)

val file = fs.globStatus(new Path("s3://trfsmallfffile/Segments/Output/*/*"))


  for (urlStatus <- file) {
    //println("S3 FILE PATH IS ===:" + urlStatus.getPath)
    val partitioName=urlStatus.getPath.toString.split("=")(1).split("\\/")(0).toString
    val finalPrefix="Fundamental.FinancialLineItem.Segments."
    val finalFileName=finalPrefix+partitioName+".txt"
    val dest = new Path("s3://trfsmallfffile/Segments/FinalOutput"+"/"+finalFileName+ " ")
    fs.rename(urlStatus.getPath, dest)
  }

This has worked for me in past这在过去对我有用

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration 
val path = "s3://<bucket>/<directory>"
val fs = FileSystem.get(new java.net.URI(path), spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(path))

The list status provides all the files in the s3 directory列表状态提供了 s3 目录下的所有文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Spark Scala中重命名S3文件而不是HDFS - How rename S3 files not HDFS in spark scala 慢速性能读取S3中的镶木地板文件与Spark中的scala - Slow performance reading parquet files in S3 with scala in Spark 如何在SPARK数据框创建的文件夹中合并所有零件文件并在Scala中将其重命名为文件夹名称 - How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala 将 Scala 代码转换为重命名和移动 CSV 文件 - Spark - PySpark - Translate Scala code to Rename and Move CSV file - Spark - PySpark 在火花 scala 中获取 s3 目录的大小 - Getting size of s3 directory in spark scala Spark Scala S3存储:权限被拒绝 - Spark Scala S3 storage: permission denied 使用spark和scala将文件写入S3非常慢。 有什么更好的方法来优化它? - Writing files to S3 using spark and scala is extremely slow. What is a better way to optimize this? Spark scala 使用 Seq(paths) 从 S3 读取多个文件 - Spark scala read multiple files from S3 using Seq(paths) 在EMR中使用Spark Scala获取S3对象大小(文件夹,文件) - Using Spark Scala in EMR to get S3 Object size (folder, files) 如何将文件从一个 S3 存储桶目录移动到同一存储桶中的另一个目录? 斯卡拉/Java - How to move files from one S3 bucket directory to another directory in same bucket? Scala/Java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM