简体   繁体   English

Spark数据集-读取CSV并写入空输出

[英]Spark Dataset - read CSV and write empty output

I have an input file test-reading.csv 我有一个输入文件test-reading.csv

id,sku,price
"100002701--425370728",100002701,12159
"100002701--510892030",100002701,11021
"100002701-235195215",100002701,12330
"100002701-110442364",100002701,9901
"100002701-1963746094",100002701,11243

I wrote following source code in order to have a minimal, complete, and verifiable example of the problem I'm facing. 我编写了以下源代码,以获取我所面临问题的最小,完整和可验证的示例。

There is a ReadingRecord class used to read the CSV file and a WritingRecord used to write the output. 有一个ReadingRecord类用于读取CSV文件,而WritingRecord用于写入输出。 Incidentally now they are almost identical but in the real program were quite different because they represent input and output structure. 顺便说一句,现在它们几乎是相同的,但是在实际程序中却大不相同,因为它们代表输入和输出结构。

The remaining code starts Spark, read the CSV, map ReadingRecord to WritingRecord and write an output CSV. 其余代码启动Spark,读取CSV,将ReadingRecord映射到WritingRecord并写入输出CSV。

The question is: why if I uncomment the for loop into the flatMapGroups method this Spark program stops to write the CSV output? 问题是:为什么如果我不将for循环注释到flatMapGroups方法中,则此Spark程序将停止编写CSV输出?

case class ReadingRecord(var id: String, var sku: Integer, var price: Integer) {
  def toWritingRecord(): WritingRecord = {
    new WritingRecord(this.id, this.sku, this.price)
  }
}

case class WritingRecord(var id: String, var sku: Integer, var price: Integer)

object ReadingRecordEncoders {
  implicit def ReadingRecordEncoder: org.apache.spark.sql.Encoder[ReadingRecord] =
    org.apache.spark.sql.Encoders.kryo[ReadingRecord]
}

object WritingTest {

  def main(args: Array[String]) {

    val conf = new SparkConf()
      .setMaster("local[8]")
      .setAppName("writing-test")
      .set("spark.executor.memory", "1gb")
      .set("spark.num.executors", "8")
      .set("spark.executor.heartbeatInterval", "120")

    val spark = SparkSession.builder().config(conf).getOrCreate()

    import spark.implicits._
    import ReadingRecordEncoders._

    val data = spark.read.option("header", "true")
      .option("delimiter", ",")
      .option("inferSchema", "true")
      .csv("test-reading.csv")
      .map(r => {
        println(r)
        new ReadingRecord(r(0).asInstanceOf[String], r(1).asInstanceOf[Integer], r(2).asInstanceOf[Integer])
      }).groupByKey(r1 => r1.sku)

    val data1 = data.flatMapGroups((a: Integer, b: Iterator[ReadingRecord]) => {
      var list = new ArrayList[ReadingRecord]
      try {
        //        for (o <- b) {
        //          list.add(o)
        //        }
      } finally {
        list.clear()
        list = null
      }

      b.map(f => f.toWritingRecord)
    })

    data1.printSchema()

    data1.write
      .format("csv")
      .option("header", "true")
      .save("output.csv")
  }
}

With the commented out code included, you are trying to reuse the Iterator b . 包括注释掉的代码,您正在尝试重用Iterator b An Iterator is modified when it is used: 使用Iterator时会对其进行修改:

It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it . 特别重要的是要注意,除非另有说明,否则在对它调用方法之后切勿使用迭代器

See the Iterator documentation . 请参阅Iterator文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM