简体   繁体   English

通过使用Scala Spark中的第一列联接两个数据集

[英]Join two datasets by using the first column in scala spark

I have two data sets like, (film name, actress's name) and (film name, director's name) 我有两个数据集,例如(电影名,女演员的名字)和(电影名,导演的名字)

I want to join them by using the name of the film, so (film name, actress's name, director's name). 我想通过使用电影的名称来加入他们,因此(电影名称,女演员的姓名,导演的姓名)。

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import scala.io.Source

object spark {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("FindFrequentPairs").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val text1: RDD[String] = sc.textFile(args(0))
    val text2: RDD[String] = sc.textFile(args(1))

    val joined = text1.join(text2)

I tried to use 'join' but it says 'cannot resolve symbol join.' 我尝试使用“连接”,但显示“无法解析符号连接”。 Do you have any idea how to join them? 你知道如何加入他们吗?

This is part of my datasets, (filme name, actress). 这是我的数据集的一部分(电影名称,女演员)。

('"Please Like Me" (2013) {Rhubarb and Custard (#1.1)}', '$haniqua')
('"Please Like Me" (2013) {Spanish Eggs (#1.5)}', '$haniqua')
('A Woman of Distinction (1950)  (uncredited)', '& Ashour, Lucienne')
('Around the World (1943)  (uncredited)', '& Ashour, Lucienne')
('Chain Lightning (1950)  (uncredited)', '& Ashour, Lucienne')

You have to create pairRDDs first for your data sets then you have to apply join transformation. 您必须先为数据集创建pairRDD,然后再应用联接转换。 Your data sets are not looking accurate. 您的数据集看起来不准确。

Please consider the below example. 请考虑以下示例。

**Dataset1**

a 1
b 2
c 3

**Dataset2**

a 8
b 4

Your code should be like below in Scala 您的代码应类似于下面的Scala

val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))

val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))

val joinRDD = pairRDD1.join(pairRDD2)

joinRDD.collect

Here is the result from scala shell 这是scala shell的结果

res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM