I am a new developer at Spark Scala and I want to ask you about my problem.
I have two huge dataframes, my second dataframe is computed from the first dataframe (it contains a distinct column from the first one).
To optimize my code, I thought about this approach :
So, it wrote this :
//val temp1 is my first DF
writeAsTextFileAndMerge("result1.csv", "/user/result", temp1, spark.sparkContext.hadoopConfiguration)
val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
.csv("/user/result/result1.csv").select("ID").distinct
writeAsTextFileAndMerge("result2.csv", "/user/result",
temp2, spark.sparkContext.hadoopConfiguration)
And this is my save function :
def writeAsTextFileAndMerge(fileName: String, outputPath: String, df: DataFrame, conf: Configuration) {
val sourceFile = WorkingDirectory
df.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv(sourceFile)
merge(fileName, sourceFile, outputPath, conf)
}
def merge(fileName: String, srcPath: String, dstPath: String, conf: Configuration) {
val hdfs = FileSystem.get(conf)
val destinationPath = new Path(dstPath)
if (!hdfs.exists(destinationPath))
hdfs.mkdirs(destinationPath)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath + "/" + fileName),
true, conf, null)
}
It seems "logical" to me but I got errors doing this. I guess it's not possible for Spark to "wait" until registering my first DF in HDFS and AFTER read this new file (or maybe I have some errors on my save function ?).
Here is the exception that I got :
19/02/16 17:27:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.ArrayIndexOutOfBoundsException: 1
java.lang.ArrayIndexOutOfBoundsException: 1
Can you help me to fix this please ?
The problem is the merge - Spark is not aware and thus not synchronized with all the HDFS operations you are making.
The good news is that you don't need to do that. just do df.write and then create a new dataframe with the read (spark will read all the parts into a single df)
ie the following would work just fine
temp1.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv("/user/result/result1.csv")
val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
.csv("/user/result/result1.csv").select("ID").distinct
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.