简体   繁体   English

将大文件拆分为小文件,并使用spark保存在不同的路径中

[英]split large file into small files and save in different paths using spark

How to split a large file/RDD/DF into small files and save to different paths. 如何将大文件/ RDD / DF拆分为小文件并保存到其他路径。

ex: If there is a file that contains usernames(single column)in a text file and wants to split that into N files and write that N files into different directories. 例如:如果文本文件中有一个包含用户名(单列)的文件,并且想要将其拆分为N个文件,然后将该N个文件写入不同的目录。

val x=20
val namesRDD=sc.textfile("readPath")
val N = namesRDD.count/x

How to split the namesRDD into N files and write those to some "savepath/N/" - ie first file is written to "savepath/1/", the second file is written to "savepath/2/" and so on. 如何将nameRDD拆分为N个文件并将其写入“ savepath / N /”中-即第一个文件写入“ savepath / 1 /”,第二个文件写入“ savepath / 2 /”,依此类推。

Using repartitionByRange will let you split your data this way. 使用repartitionByRange将使您以这种方式拆分数据。

example: 例:

df.repartitionByRange($"region").write.csv("data/regions")

This will create 1 part file for every region that appears in your data. 这将为数据中出现的每个region创建一个零件文件。 If you have 10 regions, you will have 10 different part- files. 如果您有10个区域,则将有10个不同的part-文件。

If you want to specify your own name, you will have to apply your own function to save the file with foreachPartition . 如果要指定自己的名称,则必须应用自己的函数使用foreachPartition保存文件。

df.repartitionByRange($"region")
  .foreachPartition(region => {
     // custom implementation
  })

split the file/df into N parts using repartition(if there are no columns to do repartitionByRange and want to split randomly) 使用repartition将文件/ df分成N个部分(如果没有要执行repartitionByRange的列并且想要随机分割)

df.repartition(N)
  .write.text(storePath)

then read those partitions (do whatever on that partitioned Df) 然后读取这些分区(在该分区的Df上执行任何操作)

  for (i <- 0 until N) {
    val parts = f"${i}%04d"
    val splitPath = s"${path}/part-0${partNumber}-*"
    //read data from the `splitPath`
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将大型实木复合地板文件拆分为多个实木复合地板并按时间列保存在不同的hadoop路径中 - How to Split a large parquet file to multiple parquet and save in different hadoop path by time column 如何将对象文件另存为Spark中的其他目录? - How to save as object files to different directories in Spark? 通过某些列中的键快速拆分 Spark 数据帧并另存为不同的数据帧 - Fast split Spark dataframe by keys in some column and save as different dataframes 如何将HDFS小文件合并为一个大文件? - How to merge HDFS small files into a one large file? 如何使用 Scala Spark 快速处理数百万个小型 JSON 文件? - How to process millions of small JSON files quickly using Scala Spark? Spark Scala - 读取具有不同架构的不同镶木地板文件并写入不同的 output 路径 - Spark Scala - Read different parquet files with different schema and write to different output paths Spark-使用scala拆分csv文件 - Spark - split csv file using scala Spark on Cluster:读取大量小型 avro 文件花费的时间太长而无法列出 - Spark on Cluster: Read Large number of small avro files is taking too long to list 在 Spark 中重新分区大文件 - Repartioning Large Files in Spark 面对 spark 上小数据集的大数据溢出 - Facing large data spills for small datasets on spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM