使用Apache Spark和Java將數據增量加載到鑲木地板文件中

Question

我在下面提到了以鑲木地板格式保存的數據集，想要加載新的數據並更新該文件，例如，使用UNION的“ 3”中有一個新ID，我可以添加該特定的新ID，但是如果相同的ID出現再次在last_updated列中使用最新時間戳，我只想保留最新記錄。 如何使用Apache Spark和Java實現此目的。

+-------+------------+--------------------+---------+
|     id|display_name|        last_updated|is_active|
+-------+------------+--------------------+---------+
|      1|        John|2018-07-23 08:32:...|     true|
|      2|        Tony|2018-07-22 20:32:...|     true|
+-------+------------+--------------------+---------+

Answer 1

您可以使用“ group by”通過last_update列獲取最新行。 例如，合並后，您將獲得一個數據集，例如：

+-------+------------+--------------------+---------+
|     id|display_name|        last_updated|is_active|
+-------+------------+--------------------+---------+
|      1|        John|2018-07-23 08:32:...|     true|
|      2|        Tony|2018-07-22 20:32:...|     true|
|      2|        Tony|2018-07-22 21:45:...|     true|
+-------+------------+--------------------+---------+

首先，您必須將此數據集加載到dataFrame。 因此，您應該編寫SQL：

select 
  t1.id, t1.display_name, t1.last_updated, t1.is_active
from 
  **your_temp_view** as t1 
  inner join (
    select 
      id, max(last_updated) as max_last_updated
    from
      **your_temp_view**
    group by id
  ) as t2 on t1.id = t2.id and t1.last_updated = t2.max_last_updated

使用Apache Spark和Java將數據增量加載到鑲木地板文件中

問題描述

1 個解決方案

解決方案1
1 已采納 2018-07-23 05:50:46

使用Apache Spark和Java將數據增量加載到鑲木地板文件中

問題描述

1 個解決方案

解決方案1 1 已采納 2018-07-23 05:50:46

解決方案1
1 已采納 2018-07-23 05:50:46