加入数据帧spark java

Question

首先，感谢您抽出时间阅读我的问题。

我的问题如下：在Spark with Java中，我在两个数据帧中加载了两个csv文件的数据。

这些数据框将具有以下信息。

Dataframe机场

Id | Name    | City
-----------------------
1  | Barajas | Madrid

Dataframe airport_city_state

City | state
----------------
Madrid | España

我想加入这两个数据帧，使它看起来像这样：

数据帧结果

Id | Name    | City   | state
--------------------------
1  | Barajas | Madrid | España

其中dfairport.city = dfaiport_city_state.city

但是我无法用语法来澄清所以我可以正确地进行连接。 我如何创建变量的一些代码：

 // Load the csv, you have to specify that you have header and what delimiter you have
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext,   data_airport_city_state);


// Change the name of the columns in the csv dataframe to match the columns in the database
// Once they match the name we can insert them
Dfairport
.withColumnRenamed ("leg_key", "id")
.withColumnRenamed ("leg_name", "name")
.withColumnRenamed ("leg_city", "city")

dfairport_city_state
.withColumnRenamed("city", "ciudad")
.withColumnRenamed("state", "estado");

Answer 1

您可以使用带有列名的join方法来连接两个数据帧，例如：

Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext,   data_airport_city_state);

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"));

还有一个重载版本，允许您将join类型指定为第三个参数，例如：

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"), "left_outer");

这里的更多的连接。

Answer 2

首先，非常感谢您的回复。

我已经尝试了我的两个解决方案，但没有一个工作，我得到以下错误：方法dfairport_city_state（String）未定义类型ETL_Airport

我无法访问数据框的特定列以进行连接。

编辑：已经做了加入，我把这个解决方案放在这里以防其他人帮忙;）

感谢您的一切和最好的问候

//Join de tablas en las que comparten ciudad
Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport.col("leg_city").equalTo(dfairport_city_state.col("city")));

加入数据帧spark java

问题描述

2 个解决方案

解决方案1
9 2017-03-26 20:07:13

解决方案2
5 已采纳 2017-03-27 10:26:41

加入数据帧spark java

问题描述

2 个解决方案

解决方案1 9 2017-03-26 20:07:13

解决方案2 5 已采纳 2017-03-27 10:26:41

解决方案1
9 2017-03-26 20:07:13

解决方案2
5 已采纳 2017-03-27 10:26:41