简体   繁体   中英

Create a DF after registering a previous DF in Spark Scala

I am a new developer at Spark Scala and I want to ask you about my problem.

I have two huge dataframes, my second dataframe is computed from the first dataframe (it contains a distinct column from the first one).

To optimize my code, I thought about this approach :

  • Register my first dataframe as a .csv file in HDFS
  • And then simply read this .csv file to calculate the second dataframe.

So, it wrote this :

//val temp1 is my first DF
writeAsTextFileAndMerge("result1.csv", "/user/result", temp1, spark.sparkContext.hadoopConfiguration)

val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
      .csv("/user/result/result1.csv").select("ID").distinct

    writeAsTextFileAndMerge("result2.csv", "/user/result",
      temp2, spark.sparkContext.hadoopConfiguration)

And this is my save function :

def writeAsTextFileAndMerge(fileName: String, outputPath: String, df: DataFrame, conf: Configuration) {
    val sourceFile = WorkingDirectory
    df.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv(sourceFile)
    merge(fileName, sourceFile, outputPath, conf)
  }

  def merge(fileName: String, srcPath: String, dstPath: String, conf: Configuration) {
    val hdfs = FileSystem.get(conf)
    val destinationPath = new Path(dstPath)
    if (!hdfs.exists(destinationPath))
      hdfs.mkdirs(destinationPath)
    FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath + "/" + fileName),
      true, conf, null)
  }

It seems "logical" to me but I got errors doing this. I guess it's not possible for Spark to "wait" until registering my first DF in HDFS and AFTER read this new file (or maybe I have some errors on my save function ?).

Here is the exception that I got :

19/02/16 17:27:56 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.ArrayIndexOutOfBoundsException: 1
java.lang.ArrayIndexOutOfBoundsException: 1

Can you help me to fix this please ?

The problem is the merge - Spark is not aware and thus not synchronized with all the HDFS operations you are making.

The good news is that you don't need to do that. just do df.write and then create a new dataframe with the read (spark will read all the parts into a single df)

ie the following would work just fine

temp1.write.options(Map("header" -> "true", "delimiter" -> ";")).mode("overwrite").csv("/user/result/result1.csv")
val temp2 = spark.read.options(Map("header" -> "true", "delimiter" -> ";"))
      .csv("/user/result/result1.csv").select("ID").distinct

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM