简体   繁体   English

如何有效地加入两个表 - 大表和小表?

[英]How to join two tables — big and small ones — effectively?

I have 2 data sets.One is big and one more small data set.I was processing the data in map reduce by putting the small data sets in distributed cache and getting it in mapper and performing join with some more operations. 我有2个数据集。一个是大数据集和一个小数据集。我通过将小数据集放在分布式缓存中并将其放入映射器并执行更多操作的连接来处理map中的数据。

I want this to move into spark java programming.But am getting only a map function where can i transform my rdd and for distributed cache am getting to broadcast the rdd ,but i am not getting how to pass the broadcast variable to the map function. 我希望这可以进入spark java编程。但是我只获得了一个map函数,我可以在哪里转换我的rdd和分布式缓存我要广播rdd,但我没有得到如何将广播变量传递给map函数。

 JavaPairRDD<String, String> logData = sc.wholeTextFiles(args[0]);
     logData.map(new Transformation());
     String [] vals={"val,hel","hi,by"};
     JavaRDD<String>javaRDD=sc.parallelize(Arrays.asList(vals));
     Broadcast<String> broadcastVar=sc.broadcast(javaRDD.toString());;

and my map transformation is 我的地图转换是

public class Transformation implements Function<Tuple2<String, String>, String> {.........}

i want to pass the broadcast var to map function and do the join with other transformations. 我想将广播变量传递给地图函数并与其他变换进行连接。

The thing you are talking about is called Map-Side Join . 您正在谈论的事情称为Map-Side Join In Spark it can be implemented using broadcast variable, here's a simple example in PySpark: 在Spark中,它可以使用广播变量来实现,这是PySpark中的一个简单示例:

cities = {
        1 : 'Moscow',
        2 : 'London',
        3 : 'Paris',
        4 : 'Berlin',
        5 : 'New York'
    }
bcities = sc.broadcast(cities)

data = [
    [1, 1.23],
    [2, 2.34],
    [3, 3.45],
    [4, 4.23],
    [5, 24.24],
    [1, 32.2],
    [2, 22.2],
    [4, 222.3]
]
sc.parallelize(data).map(lambda x: [bcities.value[x[0]], x[1]]).collect()

If the dataset it bigger, it is better to implement Reduce-Side Join using Spark join() transformation 如果数据集更大,最好使用Spark join()转换实现Reduce-Side Join

For Java see the example from Learning Spark -- start from line 134 where you can find the line: 对于Java,请参阅Learning Spark中的示例 - 从第134行开始,您可以在其中找到该行:

final Broadcast<String[]> signPrefixes = sc.broadcast(loadCallSignTable());

Added a constructor for the same and passed the broadcast variable. 添加了相同的构造函数并传递了广播变量。 public Transformation(Broadcast> val) { } 公共转型(广播> val){}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM