[英]How to share data among JavaRDD partitions in Spark?
I have Some object to be shared among partitions in apache spark. 我有一些对象要在Apache Spark的分区之间共享。 Below is the code snippet and problem i'm facing.
以下是我面临的代码段和问题。
private static void processDataWithResult() throws IOException {
JavaRDD<Long> idRDD = createIdRDDUsingDb();
final MeasureReportingData measureReporingData = getMeasureReportingData(jobConfiguration);
resultRDD = idRDD.mapPartitions(new FlatMapFunction<Iterator<Long>, Boolean>() {
@Override
public Iterable<Boolean> call(Iterator<Long> idIterator) throws Exception {
MeasureReportingData mrd = measureReporingData;
final List<Boolean> dummyList = new ArrayList<>();
long minId = idIterator.next();
engine.processInBatch(minId, minId + BATCH_SIZE - 1);
return (Iterable<Boolean>) dummyList;
}
});
resultRDD.count();
}
I want to distribute measureReportingData
object to all the partitions? 我想将
measureReportingData
对象分发到所有分区吗?
I get serialization errors because MeasureReportingData
contains instance members that are not Serializable
. 我收到序列化错误,因为
MeasureReportingData
包含不可Serializable
实例成员。 Simulation of the issue is specified in this question: How to serialize a Predicate<T> from Nashorn engine in java 8 此问题的仿真在以下问题中指定: 如何在Java 8中从Nashorn引擎序列化Predicate <T>
Is there another way to share measureReportingData among partitions? 还有另一种在分区之间共享measureReportingData的方法吗?
In order to share data between machines, the data has to be serialized at the source, transfer over network, and de-serialized at the destination. 为了在计算机之间共享数据,必须在源处对数据进行序列化,通过网络进行传输,并在目标处进行反序列化。 So you cannot transfer non-serializable objects.
因此,您无法传输不可序列化的对象。
If MeasureReportingData
is not serializable, you have to convert it into a serializable object, share that object then convert it back to MeasureReportingData
inside the function. 如果
MeasureReportingData
无法序列化,则必须将其转换为可序列化的对象,共享该对象,然后在函数内部将其转换回MeasureReportingData
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.