在Apache Flink中将自定义类写入HDFS

Question

在开始使用Spark之后，我试图熟悉Flink的语义。 我想将DataSet[IndexNode]写入HDFS中的持久性存储，以便以后可以由另一个进程读取。 Spark有一个提供这种功能的简单ObjectFile API，但我在Flink中找不到类似的选项。

case class IndexNode(vec: Vector[IndexNode],
                     id: Int) extends Serializable {
  // Getters and setters etc. here
}

内置接收器倾向于基于toString方法对实例进行序列化，由于类的嵌套结构，此处不适合。 我想解决方案是使用FileOutputFormat并将实例转换为字节流。 但是，我不确定如何对向量进行序列化，该向量的长度是任意的，并且深度可能很多。

Answer 1

您可以使用SerializedOutputFormat和SerializedInputFormat实现此目的。

请尝试以下步骤：

使IndexNode从FLINK扩展IOReadableWritable接口。 将@transient序列化的字段设置为@transient 。 实现write(DataOutputView out)和read(DataInputView in)方法。 write方法将从IndexNode写出所有数据，而read方法将它们读回并构建所有内部数据字段。 例如，我没有将Result类中arr字段中的所有数据序列化，而是将所有值写出，然后读回它们并以read方法重建数组。
```
 class Result(var name: String, var count: Int) extends IOReadableWritable { @transient var arr = Array(count, count) def this() { this("", 1) } override def write(out: DataOutputView): Unit = { out.writeInt(count) out.writeUTF(name) } override def read(in: DataInputView): Unit = { count = in.readInt() name = in.readUTF() arr = Array(count, count) } override def toString: String = s"$name, $count, ${getArr}" } 
```

用写数据

 myDataSet.write(new SerializedOutputFormat[Result], "/tmp/test")

然后读回去

 env.readFile(new SerializedInputFormat[Result], "/tmp/test")