简体   繁体   中英

Combine two RDDs in Scala

The first RDD , user_person , is a Hive table which records every person's information:

+---------+---+----+
|person_id|age| bmi|
+---------+---+----+
|     -100|  1|null|
|        3|  4|null|
...

Below is my second RDD , a Hive table that only has 40 row and only includes basic information:

| id|startage|endage|energy|    
|  1|       0|   0.2|     1| 
|  1|       2|    10|     3| 
|  1|      10|    20|     5| 

I want to compute every person's energy requirement by age scope for each row.

For example,a person's age is 4, so it require 3 energy. I want to add that info into RDD user_person .

How can I do this?

First, initialize the spark session with enableHiveSupport() and copy Hive config files (hive-site.xml, core-site.xml, and hdfs-site.xml) to Spark/conf/ directory, to enable Spark to read from Hive.

val sparkSession = SparkSession.builder()
  .appName("spark-scala-read-and-write-from-hive")
  .config("hive.metastore.warehouse.dir", params.hiveHost + "user/hive/warehouse")
  .enableHiveSupport()
  .getOrCreate()

Read the Hive tables as Dataframes as below:

val personDF= spark.sql("SELECT * from user_person")
val infoDF = spark.sql("SELECT * from person_info")

Join these two dataframes using below expression:

val outputDF = personDF.join(infoDF, $"age" >= $"startage" && $"age" < $"endage")

The outputDF dataframe contains all the columns of input dataframes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM