在Scala / Spark中合並兩個表

Question

我有兩個制表符分隔的數據文件，如下所示：

文件1：

number  type    data_present
 1       a        yes
 2       b        no

文件2：

type    group   number  recorded
 d       aa      10       true
 c       cc      20       false

我想合並這兩個文件，以便輸出文件如下所示：

number  type    data_present    group   recorded
  1      a         yes           NULL    NULL
  2      b         no            NULL    NULL
  10     d         NULL           aa     true
  20     cc        NULL           cc     false

如您所見，對於其他文件中不存在的列，我用NULL填充了這些位置。

關於如何在Scala / Spark中執行此操作的任何想法？

Answer 1

為您的數據集創建兩個文件：

$ cat file1.csv 
number  type    data_present
 1       a        yes
 2       b        no

$ cat file2.csv
type    group   number  recorded
 d       aa      10       true
 c       cc      20       false

將它們轉換為CSV：

$ sed -e 's/^[ \t]*//' file1.csv | tr -s ' ' | tr ' ' ',' > f1.csv
$ sed -e 's/^[ ]*//' file2.csv | tr -s ' ' | tr ' ' ',' > f2.csv

使用spark-csv模塊將CSV文件加載為數據幀：

$ spark-shell --packages com.databricks:spark-csv_2.10:1.1.0

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df1 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f1.csv", "header" -> "true"))
val df2 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f2.csv", "header" -> "true"))

現在執行聯接：

scala> df1.join(df2, df1("number") <=> df2("number") && df1("type") <=> df2("type"), "outer").show()

+------+----+------------+----+-----+------+--------+
|number|type|data_present|type|group|number|recorded|
+------+----+------------+----+-----+------+--------+
|     1|   a|         yes|null| null|  null|    null|
|     2|   b|          no|null| null|  null|    null|
|  null|null|        null|   d|   aa|    10|    true|
|  null|null|        null|   c|   cc|    20|   false|
+------+----+------------+----+-----+------+--------+

有關更多詳細信息，請轉到此處，此處和此處。

Answer 2

這將為您提供所需的輸出：

val output = file1.join(file2, Seq("number","type"), "outer")

Answer 3

簡單地將所有列轉換為String，而不是在兩個DF上進行並集。

在Scala / Spark中合並兩個表

問題描述

3 個解決方案

解決方案1
2 2015-08-04 13:50:49

解決方案2
2 2016-06-29 13:28:14

解決方案3
0 2017-06-14 14:08:54

在Scala / Spark中合並兩個表

問題描述

3 個解決方案

解決方案1 2 2015-08-04 13:50:49

解決方案2 2 2016-06-29 13:28:14

解決方案3 0 2017-06-14 14:08:54

解決方案1
2 2015-08-04 13:50:49

解決方案2
2 2016-06-29 13:28:14

解決方案3
0 2017-06-14 14:08:54