[英]Scala: How to merge the multiple CSV files in data frame
I am writing the below code to get the csv file in RDD, I want to union multiple csv files and want to store in the single RDD variable. 我正在编写以下代码以在RDD中获取csv文件,我要合并多个csv文件并希望存储在单个RDD变量中。 I am able to store the data of one csv file in RDD kindly help me how to union multiple csv files and to store in single RDD variable . 我能够将一个csv文件的数据存储在RDD中,请帮助我如何合并多个csv文件并将其存储在单个RDD变量中。
val Rdd = spark.sparkContext.textFile(“File1.csv").map(_.split(","))
I am expecting something like 我期待类似的东西
val Rdd = spark.sparkContext.textFile(“File1.csv").map(_.split(",")) union spark.sparkContext.textFile(“File2.csv").map(_.split(","))
If you have a large number of files I would suggest 如果您有大量文件,我建议
val rdd = List("file1", "file2", "file3", "file4", "file5")
.map(spark.sparkContext.textFile(_))
.reduce(_ union _)
Or if you only know you have 0 or more files: 或者,如果您只知道有0个或更多文件,则:
val rdd = getListOfFilenames()
.map(spark.sparkContext.textFile(_))
.foldLeft(spark.sparkContext.emptyRDD[String])(_ union _)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.