spark dataframe如何使用java获取最新的n行

Question

I am new in Spark. 我是Spark的新手。 Right now I am trying to join two DataFrames together. 现在，我正在尝试将两个DataFrame结合在一起。 I want to keep my dataframes in 5000 rows. 我想将数据框保持在5000行中。 Since my first dataframe has already get 5000 rows, I need to get latest 4000 rows as my second dataframe has 1000 rows. 由于我的第一个数据帧已经获得5000行，因此我需要获得最新的4000行，因为我的第二个数据帧具有1000行。 Can someone help me on how to get a dataframe with the latest 4000 rows in the first dataframe? 有人可以帮助我如何获取第一个数据帧中最新的4000行的数据帧吗？ Thanks in advance. 提前致谢。

Answer 1

I'm not sure what you're really hoping to achieve this way, but if you're in Spark 1.5 you could do something like this using monotonicallyIncreasingId : 我不确定您真正希望通过这种方式实现什么，但是如果您使用的是Spark 1.5，则可以使用monotonicallyIncreasingId做类似的事情：

val df4000 = df.sort(monotonicallyIncreasingId().desc).limit(4000)

which will sort in a descending order by the ID for each row in the dataframe, then limit the results to the first 4000. 该数据将按照数据帧中每一行的ID降序排列，然后将结果限制为前4000个。

Otherwise you could do the same using any column that you know increases consistently. 否则，您可以使用已知增加的任何列进行相同的操作。

spark dataframe如何使用java获取最新的n行

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-12-01 09:50:34

spark dataframe如何使用java获取最新的n行

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-12-01 09:50:34

解决方案1
3 已采纳 2015-12-01 09:50:34