简体   繁体   English

将数据集应用为Spark中的广播

[英]Apply dataset as Broadcast in Spark

I have two datasets, and i need to register one dataset(smaller one) as Broadcast when trying to register i am anable to use broadcast functions. 我有两个数据集,尝试注册时我需要注册一个数据集(较小的一个)作为广播,可以使用广播功能。

Here is the code: 这是代码:

JavaRDD<String> maps = ctx.textFile("C:\\Users\\sateesh\\Desktop\\country.txt");
Broadcast<JavaRDD<String>> broadcastVar = ctx.broadcast(maps);
//Broadcast<Map<Integer, String>> broadcastVar = ctx.broadcast(map);
List<Integer> list = new ArrayList<Integer>();
list.add(1);
list.add(2);
list.add(9);
JavaRDD<Integer> listrdd = ctx.parallelize(list);
JavaRDD<Object> mapr = listrdd.map(x -> broadcastVar.value());
System.out.println(mapr.collect());

Here I am not able to get broadcastVar.value().get(x) . 在这里,我无法获取broadcastVar.value().get(x) If iregister any manual map as broadcast its works well, but in case of text files it doesn't works. 如果我在广播时注册了任何手动地图,则效果很好,但是在文本文件的情况下,则不起作用。

In order to broadcast any data to cluster it has to be from driver . 为了将任何数据广播到群集,它必须来自driver So, collect() your rdd and broadcast it. 因此,请collect()并广播您的rdd

JavaRDD<String> rdd = ctx.textFile("C:\\Users\\sateesh\\Desktop\\country.txt");

Broadcast<List<String>> broadcastVar = ctx.broadcast(rdd.collect());

Please be aware collect() will bring entire rdd to driver it might throw OOM exception. 请注意collect()会将整个rdd带到驱动程序中,这可能会引发OOM异常。 Broadcast is suggested for less sized data. 建议对较小的数据进行广播。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM