I don't know how to do the same using parquet file

Question

import org.apache.spark.sql._

    object Test {

      def main(args: Array[String]) {

        val spark = SparkSession.builder()
          .appName("Test")
          .master("local[*]")
          .getOrCreate()
        val sc = spark.sparkContext
        val tempDF=spark.read.csv("data.csv")
        tempDF.coalesce(1).write.parquet("Parquet")
        val rdd = sc.textFile("Parquet")

I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file. Link of (data.csv) and (output.csv)

    val header = rdd.first
    val rdd1 = rdd.filter(_ != header)
    val resultRDD = rdd1.map { r =>
      val Array(country, values) = r.split(",")
      country -> values
    }.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))

    import spark.sqlContext.implicits._
    val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
    dataSet.coalesce(1).write.option("header","true").csv("output")

  }

  case class CountryAgg(country: String, values: String)

}

Answer 1

I reckon, you are trying to add up corresponding elements from the array based on Country . I have done this using DataFrame APIs , which makes the job easier.

Code for your reference:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("inferSchema", "true")
              .option("path", "/path/to/input/data.csv")
              .load()


val df1 = df.select(
                $"Country", 
                (split($"Values", ";"))(0).alias("c1"),
                (split($"Values", ";"))(1).alias("c2"),
                (split($"Values", ";"))(2).alias("c3"),
                (split($"Values", ";"))(3).alias("c4"),
                (split($"Values", ";"))(4).alias("c5")
             )
             .groupBy($"Country")
             .agg(
             sum($"c1" cast "int").alias("s1"),
             sum($"c2" cast "int").alias("s2"),
             sum($"c3" cast "int").alias("s3"),
             sum($"c4" cast "int").alias("s4"),
             sum($"c5" cast "int").alias("s5")             
             )
             .select(
                $"Country", 
                concat(
                    $"s1", lit(";"), 
                    $"s2", lit(";"), 
                    $"s3", lit(";"), 
                    $"s4", lit(";"), 
                    $"s5"
                ).alias("Values")
            )

df1.repartition(1)
    .write
    .format("csv")
    .option("delimiter",",")
    .option("header", "true")
    .option("path", "/path/to/output")
    .save()

Here is the output for your reference.

scala> df1.show()
+-------+-------------------+
|Country|             Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
|  China| 218;239;234;209;75|
|  India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+

PS:

You can change the output format to parquet/orc or anything you wish.

I have repartitioned df1 into 1 partition just so that you could get a single output file . You can choose to repartition or not based on your usecase

Hope this helps.

Answer 2

You could just read the file as parquet and perform the same operations on the resulting dataframe:

val spark = SparkSession.builder()
    .appName("Test")
    .master("local[*]")
    .getOrCreate()

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")

If you need an rdd you can then just call:

val rdd = parquetFileDF.rdd

The you can proceed with the transformations as before and write as parquet like you have in your question.

I don't know how to do the same using parquet file

Question

2 answers

solution1
1 ACCPTED 2019-11-21 08:10:02

solution2
0 2019-11-21 08:01:50

I don't know how to do the same using parquet file

Question

2 answers

solution1 1 ACCPTED 2019-11-21 08:10:02

solution2 0 2019-11-21 08:01:50

solution1
1 ACCPTED 2019-11-21 08:10:02

solution2
0 2019-11-21 08:01:50