简体   繁体   中英

I don't know how to do the same using parquet file

Link to (data.csv) and (output.csv)

import org.apache.spark.sql._

    object Test {

      def main(args: Array[String]) {

        val spark = SparkSession.builder()
          .appName("Test")
          .master("local[*]")
          .getOrCreate()
        val sc = spark.sparkContext
        val tempDF=spark.read.csv("data.csv")
        tempDF.coalesce(1).write.parquet("Parquet")
        val rdd = sc.textFile("Parquet")

I Convert data.csv into optimised parquet file and then loaded it and now i want to do all the transformation on parquet file just like i did on csv file given below and then save it as a parquet file. Link of (data.csv) and (output.csv)

    val header = rdd.first
    val rdd1 = rdd.filter(_ != header)
    val resultRDD = rdd1.map { r =>
      val Array(country, values) = r.split(",")
      country -> values
    }.reduceByKey((a, b) => a.split(";").zip(b.split(";")).map { case (i1, i2) => i1.toInt + i2.toInt }.mkString(";"))

    import spark.sqlContext.implicits._
    val dataSet = resultRDD.map { case (country: String, values: String) => CountryAgg(country, values) }.toDS()
    dataSet.coalesce(1).write.option("header","true").csv("output")

  }

  case class CountryAgg(country: String, values: String)

}

I reckon, you are trying to add up corresponding elements from the array based on Country . I have done this using DataFrame APIs , which makes the job easier.

Code for your reference:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = spark.read
              .format("csv")
              .option("header", "true")
              .option("inferSchema", "true")
              .option("path", "/path/to/input/data.csv")
              .load()


val df1 = df.select(
                $"Country", 
                (split($"Values", ";"))(0).alias("c1"),
                (split($"Values", ";"))(1).alias("c2"),
                (split($"Values", ";"))(2).alias("c3"),
                (split($"Values", ";"))(3).alias("c4"),
                (split($"Values", ";"))(4).alias("c5")
             )
             .groupBy($"Country")
             .agg(
             sum($"c1" cast "int").alias("s1"),
             sum($"c2" cast "int").alias("s2"),
             sum($"c3" cast "int").alias("s3"),
             sum($"c4" cast "int").alias("s4"),
             sum($"c5" cast "int").alias("s5")             
             )
             .select(
                $"Country", 
                concat(
                    $"s1", lit(";"), 
                    $"s2", lit(";"), 
                    $"s3", lit(";"), 
                    $"s4", lit(";"), 
                    $"s5"
                ).alias("Values")
            )

df1.repartition(1)
    .write
    .format("csv")
    .option("delimiter",",")
    .option("header", "true")
    .option("path", "/path/to/output")
    .save()

Here is the output for your reference.

scala> df1.show()
+-------+-------------------+
|Country|             Values|
+-------+-------------------+
|Germany| 144;166;151;172;70|
|  China| 218;239;234;209;75|
|  India| 246;153;148;100;90|
| Canada| 183;258;150;263;71|
|England|178;114;175;173;153|
+-------+-------------------+

PS:

  1. You can change the output format to parquet/orc or anything you wish.

  2. I have repartitioned df1 into 1 partition just so that you could get a single output file . You can choose to repartition or not based on your usecase

Hope this helps.

You could just read the file as parquet and perform the same operations on the resulting dataframe:

val spark = SparkSession.builder()
    .appName("Test")
    .master("local[*]")
    .getOrCreate()

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("data.parquet")

If you need an rdd you can then just call:

val rdd = parquetFileDF.rdd

The you can proceed with the transformations as before and write as parquet like you have in your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM