表具有相同列名（具有不同数据）的Apache Spark Join数据集

Question

I want to join multiple datasets that have some columns with same name while having different data. 我想加入多个数据集，这些数据集的某些列具有相同的名称，而数据却不同。 This is possible to rename dataset columns while conversion to dataframe. 转换为数据框时可以重命名数据集列。 But is it possible to use rename or setting prefix to column names while using datasets. 但是在使用数据集时可以使用重命名或为列名设置前缀。

Dataset<Row> uct = spark.read().jdbc(jdbcUrl, "uct", connectionProperties);
Dataset<Row> si = spark.read().jdbc(jdbcUrl, "si", connectionProperties).filter("status = 'ACTIVE'");
Dataset<Row> uc = uct.join(si, uct.col("service_id").equalTo(si.col("id")))

uc will have columns with same name 'code' then it will be difficult to get value of code from either uct.code or si.code uc将具有相同名称“ code”的列，那么将很难从uct.code或si.code获得代码的值

Answer 1

Dataframe is an alias for Dataset. 数据框是数据集的别名。 So practically you are using a dataframe in your code. 因此，实际上您在代码中使用了数据框。 If you want to retain both the columns with the same name, then you will have to rename one of the columns before performing join using "withColumnRenamed" option. 如果要保留两个具有相同名称的列，则必须在使用“ withColumnRenamed”选项执行连接之前重命名其中一个列。

表具有相同列名（具有不同数据）的Apache Spark Join数据集

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-01-19 11:24:15

表具有相同列名（具有不同数据）的Apache Spark Join数据集

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-01-19 11:24:15

解决方案1
1 已采纳 2018-01-19 11:24:15