[英]how to join two datasets by key in scala spark
I have two datasets and each dataset have two elements. 我有两个数据集,每个数据集都有两个元素。 Below are examples.
以下是示例。
Data1: (name, animal) 数据1 :(名称,动物)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit) 数据2 :(名称,水果)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit) 预期结果:(名称,动物,水果)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' 我想通过使用第一列“名称”来加入这两个数据集。 I have tried to do this for a couple of hours, but I couldn't figure out.
我已经尝试了几个小时,但是我不知道。 Can anyone help me?
谁能帮我?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working! 上面的代码不起作用!
join
is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)]
. join
是在成对的RDD[(K,V)]
上定义的,即RDD[(K,V)]
类型的RDD[(K,V)]
。 The first step needed is to transform the input data into the right type. 第一步需要将输入数据转换为正确的类型。
We first need to transform the original data of type String
into pairs of (Key, Value)
: 我们首先需要将
String
类型的原始数据转换为(Key, Value)
:
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",")
expression because the key contains commas) (请注意,我们不能使用简单的
split(",")
表达式,因为键包含逗号)
Then we use that function to parse the text input data: 然后,我们使用该函数来解析文本输入数据:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join
method to join the two RDDs 最后,我们使用
join
方法来连接两个RDD
val joined = kvRdd1.join(kvRdd2)
// Let's check out results //让我们查看结果
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. 您必须先为数据集创建pairRDD,然后再应用联接转换。 Your data sets are not looking accurate.
您的数据集看起来不准确。
Please consider the below example. 请考虑以下示例。
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala 您的代码应类似于下面的Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell 这是scala shell的结果
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.