Spark RDD 使用列表加入操作

Question

I have the following RDDs:我有以下 RDD：

JavaPairRDD<List<String>, String> firstRDD = ...
firstRDD.foreach(row -> System.out.println(row._1() + ", " + row._2()));
// [Man, Parent], Father

JavaPairRDD<List<String>, String> secondRDD = ...
secondRDD.foreach(row -> System.out.println(row._1() + ", " + row._2()));
// [Man, Parent, Father], Person

I want to perform an inner join, so that one row is equals to another row if the left key is IN ( ie , sublist of) the right key (in the former example, [Man, Parent] is in [Man, Parent, Father] ).我想执行内连接，以便如果左键是 IN（即子列表）右键（在前一个示例中， [Man, Parent]在[Man, Parent, Father] ）。

Any suggestions?有什么建议么？

Thanks!谢谢！

Answer 1

For RDDs (and also for JavaPairRDDs) the join operation(s) can only check for exactly matching keys.对于 RDD（以及 JavaPairRDD），连接操作只能检查完全匹配的键。

Therefore we have to transform the RDDs into Dataframes:因此，我们必须将 RDD 转换为 Dataframe：

public static Dataset<Row> toDataframe(SparkSession spark, JavaPairRDD<List<String>, String> rdd) {
    JavaRDD<Row> rowRDD1 = rdd.map(tuple -> {
        Seq<String> key = JavaConverters.asScalaIteratorConverter(tuple._1().iterator()).asScala().toSeq();
        return RowFactory.create(key, tuple._2());
    });
    StructType st = new StructType()
            .add(new StructField("key", DataTypes.createArrayType(DataTypes.StringType), true, new MetadataBuilder().build()))
            .add(new StructField("value", DataTypes.StringType, true, new MetadataBuilder().build()));
    return spark.createDataFrame(rowRDD1, st);
}

For the join criteria, we need a UDF to check if one array is part of the other.对于连接条件，我们需要一个 UDF 来检查一个数组是否是另一个数组的一部分。 If the order of the elements is not important, array_intersect could also be used.如果元素的顺序不重要，也可以使用array_intersect 。

UserDefinedFunction contains = functions.udf((Seq<String> a, Seq<String> b) -> b.containsSlice(a), DataTypes.BooleanType);

Putting these two elements together, we get将这两个元素放在一起，我们得到

Dataset<Row> df1 = toDataframe(spark, firstRDD);
Dataset<Row> df2 = toDataframe(spark, secondRDD);
Dataset<Row> result = df1.join(df2,contains.apply(df1.col("key"), df2.col("key")));

With the input data随着输入数据

firstRDD        secondRDD
+------+-----+  +------------+-----+
|   key|value|  |         key|value|
+------+-----+  +------------+-----+
|[a, b]|    A|  |   [a, b, c]|    C|
|[b, a]|    B|  |[a, b, c, d]|    D|
+------+-----+  +------------+-----+

we get我们得到

+------+-----+------------+-----+
|   key|value|         key|value|
+------+-----+------------+-----+
|[a, b]|    A|   [a, b, c]|    C|
|[a, b]|    A|[a, b, c, d]|    D|
+------+-----+------------+-----+

Please not that using an UDF as join criteria might not be the fastest option .请注意，使用 UDF 作为连接条件可能不是最快的选择。

Spark RDD 使用列表加入操作

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-23 20:35:12

Spark RDD 使用列表加入操作

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-23 20:35:12

解决方案1
1 已采纳 2020-06-23 20:35:12