简体   繁体   中英

How to share data among JavaRDD partitions in Spark?

I have Some object to be shared among partitions in apache spark. Below is the code snippet and problem i'm facing.

private static void processDataWithResult() throws IOException {

        JavaRDD<Long> idRDD = createIdRDDUsingDb();
        final MeasureReportingData measureReporingData = getMeasureReportingData(jobConfiguration);

        resultRDD = idRDD.mapPartitions(new FlatMapFunction<Iterator<Long>, Boolean>() {
            @Override
            public Iterable<Boolean> call(Iterator<Long> idIterator) throws Exception {

                MeasureReportingData mrd = measureReporingData; 

                final List<Boolean> dummyList = new ArrayList<>();

                long minId = idIterator.next();

                engine.processInBatch(minId, minId + BATCH_SIZE - 1);
                return (Iterable<Boolean>) dummyList;
            }
        });

        resultRDD.count();

    }

I want to distribute measureReportingData object to all the partitions?

I get serialization errors because MeasureReportingData contains instance members that are not Serializable . Simulation of the issue is specified in this question: How to serialize a Predicate<T> from Nashorn engine in java 8

Is there another way to share measureReportingData among partitions?

In order to share data between machines, the data has to be serialized at the source, transfer over network, and de-serialized at the destination. So you cannot transfer non-serializable objects.

If MeasureReportingData is not serializable, you have to convert it into a serializable object, share that object then convert it back to MeasureReportingData inside the function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM