在Scala中合并两个RDD

Question

The first RDD , user_person , is a Hive table which records every person's information: 第一个RDD user_person是一个Hive表，其中记录了每个人的信息：

+---------+---+----+
|person_id|age| bmi|
+---------+---+----+
|     -100|  1|null|
|        3|  4|null|
...

Below is my second RDD , a Hive table that only has 40 row and only includes basic information: 以下是我的第二个RDD ，一个只有40行并且仅包含基本信息的Hive表：

| id|startage|endage|energy|    
|  1|       0|   0.2|     1| 
|  1|       2|    10|     3| 
|  1|      10|    20|     5|

I want to compute every person's energy requirement by age scope for each row. 我想按年龄范围为每一行计算每个人的能量需求。

For example，a person's age is 4, so it require 3 energy. 例如，一个人的年龄为4，因此需要3能量。 I want to add that info into RDD user_person . 我想将该信息添加到RDD user_person 。

How can I do this? 我怎样才能做到这一点？

Answer 1

First, initialize the spark session with enableHiveSupport() and copy Hive config files (hive-site.xml, core-site.xml, and hdfs-site.xml) to Spark/conf/ directory, to enable Spark to read from Hive. 首先，使用enableHiveSupport()初始化spark会话，并将Hive配置文件（hive-site.xml，core-site.xml和hdfs-site.xml）复制到Spark / conf /目录，以使Spark能够从Hive读取。

val sparkSession = SparkSession.builder()
  .appName("spark-scala-read-and-write-from-hive")
  .config("hive.metastore.warehouse.dir", params.hiveHost + "user/hive/warehouse")
  .enableHiveSupport()
  .getOrCreate()

Read the Hive tables as Dataframes as below: 读取Hive表作为数据框，如下所示：

val personDF= spark.sql("SELECT * from user_person")
val infoDF = spark.sql("SELECT * from person_info")

Join these two dataframes using below expression: 使用下面的表达式连接这两个数据框：

val outputDF = personDF.join(infoDF, $"age" >= $"startage" && $"age" < $"endage")

The outputDF dataframe contains all the columns of input dataframes. outputDF数据outputDF包含输入数据outputDF所有列。

在Scala中合并两个RDD

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-07-24 04:36:07

在Scala中合并两个RDD

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-07-24 04:36:07

解决方案1
2 已采纳 2018-07-24 04:36:07