使用 Dataframes 在 Spark 中处理数据差异（Deltas）

Question

i have one parquet file in hdfs as initial load of my data.我在 hdfs 中有一个镶木地板文件作为我数据的初始加载。 All the next parquets are only these datasets wich have change each day to the initial load (in chronological order).所有接下来的镶木地板只是这些数据集，每天都会更改为初始负载（按时间顺序）。 This are my deltas.这是我的三角洲。 I want to read all or a few parquet files to have the late data of a specific date.我想阅读所有或几个镶木地板文件以获取特定日期的最新数据。 Deltas can contain new records, too.增量也可以包含新记录。

Example:例子：

Initial Data (Folder: /path/spezific_data/20180101):初始数据（文件夹：/path/spezific_data/20180101）：

ID| Name  | Street    | 
1 | "Tom" |"Street 1"| 
2 | "Peter"|"Street 2"|

Delta 1 (Folder: /path/spezific_data/20180102):增量 1（文件夹：/path/spezific_data/20180102）：

ID| Name  | Street    | 
1 | "Tom" |"Street 21"|

Delta 2 (Folder: : /path/spezific_data/20180103): Delta 2（文件夹：：/path/spezific_data/20180103）：

ID| Name  | Street    | 
2 | "Peter" |"Street 44"|
3 | "Hans" | "Street 12"|

Delta 3 (Folder: : /path/spezific_data/20180105): Delta 3（文件夹：：/path/spezific_data/20180105）：

ID| Name  | Street    | 
2 | "Hans" |"Street 55"|

It is possible that one day have Deltas but are loaded on day later.有可能某天有 Delta，但在一天后加载。 (Look at Delta 2 and Delta 3) So the Folder /path/spezific_data/20180104 does note exist and we never want to load this date. （查看 Delta 2 和 Delta 3）所以文件夹 /path/spezific_data/20180104 确实存在，我们永远不想加载这个日期。 Now i want to load different cases.现在我想加载不同的案例。

Only initial data: That is an easy load of a Directory.只有初始数据：这是一个目录的简单加载。

initial = spark.read.parquet("hdfs:/path/spezific_data/20180101/")

Until a spezific date (20180103)直到特定日期 (20180103)

 initial_df = spark.read.parquet("hdfs:/path/spezific_data/20180101/") <br>

 delta_df = spark.read.parquet("hdfs:/path/spezific_data/20180102/")

Now i have to merge ("Update" i know spark RDDs or dataframes can not do a update) these datasets and laod the other one an merge too.现在我必须合并（“更新”，我知道 spark RDD 或数据帧无法更新）这些数据集并将另一个数据集也合并。 Currently i solve this with this line of code (but in an for Loop):目前我用这行代码解决了这个问题（但在for循环中）：

 new_df = delta_df.union(initila_df).dropDuplicates("ID") <br>
 delta_df = spark.read.parqeut("hdfs:/mypath/20180103/") <br>
 new_df = delta_df.union(new_df).dropDuplicates("ID") <br>

But i think that is not a good way to do this.但我认为这不是一个好方法。

Load all data in Folder "/path/spezific_data" I do this like step one with a for loop to the late date加载文件夹“/path/spezific_data”中的所有数据我这样做就像第一步一样，使用for循环到后期

Questions: Can i do this like this?问题：我可以这样做吗？ Are there better ways?有没有更好的方法？ Can i load this in one df and merge them there?我可以将其加载到一个 df 中并将它们合并到那里吗？
Currently the load takes very Long (one hour)目前负载需要很长时间（一小时）

Update 1:更新1：
I tried to do something like this.我试图做这样的事情。 If i run this code, it go througt all dates until my enddate (i see this on my println(date)).如果我运行此代码，它 go 将通过所有日期直到我的结束日期（我在我的 println(date) 上看到这个）。 After that, i get an Java.lang.StackOverflowError.之后，我得到一个 Java.lang.StackOverflowError。 Where is the error?错误在哪里？

import org.apache.spark.sql.functions.col
import util.control.Breaks._

var sourcePath = "hdfs:sourceparth/"
var destinationPath = "hdfs:destiantionpath/result"
var initial_date = "20170427"
var start_year = 2017
var end_year = 2019
var end_month = 10
var end_day = 31

var m : String = _
var d : String = _
var date : String = _
var delta_df : org.apache.spark.sql.DataFrame = _
var doubleRows_df : org.apache.spark.sql.DataFrame = _

//final DF, initial load
var final_df = spark.read.parquet(sourcePath + initial_date +  "*")

breakable{
   for(year <- 2017 to end_year; month <- 1 to 12; day <- 1 to 31){
     //Create date String
     m = month.toString()
     d = day.toString()
     if(month < 10)
       m = "0" + m
     if(day < 10)
       d = "0" + d
     date = year.toString() + m + d

     try{
       //one delta
       delta_df = spark.read.parquet(sourcePath + date + "*")

       //delete double Rows (i want to ignore them
       doubleRows_df  = delta_df.groupBy("key").count().where("count > 1").select("key")
       delta_df = delta_df.join(doubleRows_df, Seq("key"), "leftanti")

       //deletes all (old) rows in final_df, that are in delta_df
       final_df = final_df.join(delta_df, Seq("key"), "leftanti")

       //add all new rows in delta
       final_df = final_df.union(delta_df)

       println(date)
     }catch{
       case e:org.apache.spark.sql.AnalysisException=>{}
     }
    if(day == end_day && month == end_month &&  year == end_year)
       break
   }
 }
 final_df.write.mode("overwrite").parquet(destinationPath)

The full stacktrace:完整的堆栈跟踪：

19/11/26 11:19:04 WARN util.Utils: Suppressing exception in finally: Java heap space
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
        at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
        at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
        at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
        at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
        at com.esotericsoftware.kryo.io.Output.flush(Output.java:181)
        at com.esotericsoftware.kryo.io.Output.close(Output.java:191)
        at org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:223)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
        at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277)
        at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
        at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
        at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
        at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
        at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488)
        at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
        at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
        at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
        at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
        at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
        at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
        at com.esotericsoftware.kryo.io.Output.flush(Output.java:181)
        at com.esotericsoftware.kryo.io.Output.require(Output.java:160)
        at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:246)
        at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:232)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:54)
        at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:43)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
        at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:209)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:276)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:276)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
        at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277)
        at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
        at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
        at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
        at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
        at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488)
        at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
        at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Answer 1

distinct or dropDuplicates is not an option, since you can't control which values will be taken. distinct或dropDuplicates不是一个选项，因为您无法控制将采用哪些值。 It very whell might happen, that new value will not be added, while old value will be preserved.很可能会发生，新的价值不会被添加，而旧的价值会被保留。
You need to do join over ID - see types of joins here .您需要通过ID进行join - 请参阅此处的连接类型。 The joined rows should then contain either only old, or only new, or both.然后，连接的行应该只包含旧的，或者只包含新的，或者两者兼而有之。 When only old or only new - you take the one that present, when both - you take only new.当只有旧的或只有新的时——你拿现在的那个，当两者兼有时——你只拿新的。

Example from here how to add multiple deltas at once. 此处的示例如何一次添加多个增量。

Question: What are the best-selling and the second best-selling products in every category?问题：每个品类中最畅销和第二畅销的产品是什么？

val dataset = Seq(
  ("Thin",       "cell phone", 6000),
  ("Normal",     "tablet",     1500),
  ("Mini",       "tablet",     5500),
  ("Ultra thin", "cell phone", 5000),
  ("Very thin",  "cell phone", 6000),
  ("Big",        "tablet",     2500),
  ("Bendable",   "cell phone", 3000),
  ("Foldable",   "cell phone", 3000),
  ("Pro",        "tablet",     4500),
  ("Pro2",       "tablet",     6500))
  .toDF("product", "category", "revenue")

val overCategory = Window.partitionBy('category).orderBy('revenue.desc)

val ranked = data.withColumn("rank", dense_rank.over(overCategory))

scala> ranked.show
+----------+----------+-------+----+
|   product|  category|revenue|rank|
+----------+----------+-------+----+
|      Pro2|    tablet|   6500|   1|
|      Mini|    tablet|   5500|   2|
|       Pro|    tablet|   4500|   3|
|       Big|    tablet|   2500|   4|
|    Normal|    tablet|   1500|   5|
|      Thin|cell phone|   6000|   1|
| Very thin|cell phone|   6000|   1|
|Ultra thin|cell phone|   5000|   2|
|  Bendable|cell phone|   3000|   3|
|  Foldable|cell phone|   3000|   3|
+----------+----------+-------+----+

scala> ranked.where('rank <= 2).show
+----------+----------+-------+----+
|   product|  category|revenue|rank|
+----------+----------+-------+----+
|      Pro2|    tablet|   6500|   1|
|      Mini|    tablet|   5500|   2|
|      Thin|cell phone|   6000|   1|
| Very thin|cell phone|   6000|   1|
|Ultra thin|cell phone|   5000|   2|
+----------+----------+-------+----+

UPDATE 1:更新 1：

First of all consider using date utilities instead of manually iterating over numbers to get date:首先考虑使用日期实用程序而不是手动迭代数字来获取日期：

Date dt = new Date();
LocalDateTime.from(dt.toInstant()).plusDays(1);

See this for more details.有关更多详细信息，请参阅此内容。

Second - please post full stacktrace, not just StackOverflowException .其次 - 请发布完整的堆栈跟踪，而不仅仅是StackOverflowException 。

使用 Dataframes 在 Spark 中处理数据差异（Deltas）

问题描述

1 个解决方案

解决方案1
0 2019-11-22 10:02:26

使用 Dataframes 在 Spark 中处理数据差异（Deltas）

问题描述

1 个解决方案

解决方案1 0 2019-11-22 10:02:26

解决方案1
0 2019-11-22 10:02:26