简体   繁体   English

Spark RDD 使用列表加入操作

[英]Spark RDDs join operation with lists

I have the following RDDs:我有以下 RDD:

JavaPairRDD<List<String>, String> firstRDD = ...
firstRDD.foreach(row -> System.out.println(row._1() + ", " + row._2()));
// [Man, Parent], Father

JavaPairRDD<List<String>, String> secondRDD = ...
secondRDD.foreach(row -> System.out.println(row._1() + ", " + row._2()));
// [Man, Parent, Father], Person

I want to perform an inner join, so that one row is equals to another row if the left key is IN ( ie , sublist of) the right key (in the former example, [Man, Parent] is in [Man, Parent, Father] ).我想执行内连接,以便如果左键是 IN(子列表)右键(在前一个示例中, [Man, Parent][Man, Parent, Father] )。

Any suggestions?有什么建议么?

Thanks!谢谢!

For RDDs (and also for JavaPairRDDs) the join operation(s) can only check for exactly matching keys.对于 RDD(以及 JavaPairRDD), 连接操作只能检查完全匹配的键。

Therefore we have to transform the RDDs into Dataframes:因此,我们必须将 RDD 转换为 Dataframe:

public static Dataset<Row> toDataframe(SparkSession spark, JavaPairRDD<List<String>, String> rdd) {
    JavaRDD<Row> rowRDD1 = rdd.map(tuple -> {
        Seq<String> key = JavaConverters.asScalaIteratorConverter(tuple._1().iterator()).asScala().toSeq();
        return RowFactory.create(key, tuple._2());
    });
    StructType st = new StructType()
            .add(new StructField("key", DataTypes.createArrayType(DataTypes.StringType), true, new MetadataBuilder().build()))
            .add(new StructField("value", DataTypes.StringType, true, new MetadataBuilder().build()));
    return spark.createDataFrame(rowRDD1, st);
}

For the join criteria, we need a UDF to check if one array is part of the other.对于连接条件,我们需要一个 UDF 来检查一个数组是否是另一个数组的一部分。 If the order of the elements is not important, array_intersect could also be used.如果元素的顺序不重要,也可以使用array_intersect

UserDefinedFunction contains = functions.udf((Seq<String> a, Seq<String> b) -> b.containsSlice(a), DataTypes.BooleanType);

Putting these two elements together, we get将这两个元素放在一起,我们得到

Dataset<Row> df1 = toDataframe(spark, firstRDD);
Dataset<Row> df2 = toDataframe(spark, secondRDD);
Dataset<Row> result = df1.join(df2,contains.apply(df1.col("key"), df2.col("key")));

With the input data随着输入数据

firstRDD        secondRDD
+------+-----+  +------------+-----+
|   key|value|  |         key|value|
+------+-----+  +------------+-----+
|[a, b]|    A|  |   [a, b, c]|    C|
|[b, a]|    B|  |[a, b, c, d]|    D|
+------+-----+  +------------+-----+

we get我们得到

+------+-----+------------+-----+
|   key|value|         key|value|
+------+-----+------------+-----+
|[a, b]|    A|   [a, b, c]|    C|
|[a, b]|    A|[a, b, c, d]|    D|
+------+-----+------------+-----+

Please not that using an UDF as join criteria might not be the fastest option .请注意,使用 UDF 作为连接条件可能不是最快的选择

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM