[英]Spark Dataset - read CSV and write empty output
I have an input file test-reading.csv
我有一个输入文件test-reading.csv
id,sku,price
"100002701--425370728",100002701,12159
"100002701--510892030",100002701,11021
"100002701-235195215",100002701,12330
"100002701-110442364",100002701,9901
"100002701-1963746094",100002701,11243
I wrote following source code in order to have a minimal, complete, and verifiable example of the problem I'm facing. 我编写了以下源代码,以获取我所面临问题的最小,完整和可验证的示例。
There is a ReadingRecord
class used to read the CSV file and a WritingRecord
used to write the output. 有一个ReadingRecord
类用于读取CSV文件,而WritingRecord
用于写入输出。 Incidentally now they are almost identical but in the real program were quite different because they represent input and output structure. 顺便说一句,现在它们几乎是相同的,但是在实际程序中却大不相同,因为它们代表输入和输出结构。
The remaining code starts Spark, read the CSV, map ReadingRecord
to WritingRecord
and write an output CSV. 其余代码启动Spark,读取CSV,将ReadingRecord
映射到WritingRecord
并写入输出CSV。
The question is: why if I uncomment the for
loop into the flatMapGroups
method this Spark program stops to write the CSV output? 问题是:为什么如果我不将for
循环注释到flatMapGroups
方法中,则此Spark程序将停止编写CSV输出?
case class ReadingRecord(var id: String, var sku: Integer, var price: Integer) {
def toWritingRecord(): WritingRecord = {
new WritingRecord(this.id, this.sku, this.price)
}
}
case class WritingRecord(var id: String, var sku: Integer, var price: Integer)
object ReadingRecordEncoders {
implicit def ReadingRecordEncoder: org.apache.spark.sql.Encoder[ReadingRecord] =
org.apache.spark.sql.Encoders.kryo[ReadingRecord]
}
object WritingTest {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("writing-test")
.set("spark.executor.memory", "1gb")
.set("spark.num.executors", "8")
.set("spark.executor.heartbeatInterval", "120")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
import ReadingRecordEncoders._
val data = spark.read.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv("test-reading.csv")
.map(r => {
println(r)
new ReadingRecord(r(0).asInstanceOf[String], r(1).asInstanceOf[Integer], r(2).asInstanceOf[Integer])
}).groupByKey(r1 => r1.sku)
val data1 = data.flatMapGroups((a: Integer, b: Iterator[ReadingRecord]) => {
var list = new ArrayList[ReadingRecord]
try {
// for (o <- b) {
// list.add(o)
// }
} finally {
list.clear()
list = null
}
b.map(f => f.toWritingRecord)
})
data1.printSchema()
data1.write
.format("csv")
.option("header", "true")
.save("output.csv")
}
}
With the commented out code included, you are trying to reuse the Iterator
b
. 包括注释掉的代码,您正在尝试重用Iterator
b
。 An Iterator
is modified when it is used: 使用Iterator
时会对其进行修改:
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it . 特别重要的是要注意,除非另有说明,否则在对它调用方法之后 , 切勿使用迭代器 。
See the Iterator documentation . 请参阅Iterator文档 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.