简体   繁体   English

如何通过在Scala Spark中键入键来连接两个数据集

[英]how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements. 我有两个数据集,每个数据集都有两个元素。 Below are examples. 以下是示例。

Data1: (name, animal) 数据1 :(名称,动物)

('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...

Data2: (name, fruit) 数据2 :(名称,水果)

('a,efg', 'apple')
('abc,def', 'banana(1)')
...

Results expected: (name, animal, fruit) 预期结果:(名称,动物,水果)

('abc,def', 'monkey(1)', 'banana(1)')
... 

I want to join these two datasets by using first column 'name.' 我想通过使用第一列“名称”来加入这两个数据集。 I have tried to do this for a couple of hours, but I couldn't figure out. 我已经尝试了几个小时,但是我不知道。 Can anyone help me? 谁能帮我?

val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))

val joined = text1.join(text2)

Above code is not working! 上面的代码不起作用!

join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)] . join是在成对的RDD[(K,V)]上定义的,即RDD[(K,V)]类型的RDD[(K,V)] The first step needed is to transform the input data into the right type. 第一步需要将输入数据转换为正确的类型。

We first need to transform the original data of type String into pairs of (Key, Value) : 我们首先需要将String类型的原始数据转换为(Key, Value)

val parse:String => (String, String) = s => {
  val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
  s match {
    case regex(k,v) => (k,v)
    case _ => ("","")
  }
}

(Note that we can't use a simple split(",") expression because the key contains commas) (请注意,我们不能使用简单的split(",")表达式,因为键包含逗号)

Then we use that function to parse the text input data: 然后,我们使用该函数来解析文本输入数据:

val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")

val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)

val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)

Finally, we use the join method to join the two RDDs 最后,我们使用join方法来连接两个RDD

val joined = kvRdd1.join(kvRdd2)

// Let's check out results //让我们查看结果

joined.collect

// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))

You have to create pairRDDs first for your data sets then you have to apply join transformation. 您必须先为数据集创建pairRDD,然后再应用联接转换。 Your data sets are not looking accurate. 您的数据集看起来不准确。

Please consider the below example. 请考虑以下示例。

**Dataset1**

a 1
b 2
c 3

**Dataset2**

a 8
b 4

Your code should be like below in Scala 您的代码应类似于下面的Scala

    val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))

    val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))

    val joinRDD = pairRDD1.join(pairRDD2)

    joinRDD.collect

Here is the result from scala shell 这是scala shell的结果

res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM