如何在 Pyspark 中使用 ROW_NUMBER 转换 PARTITION_BY 和 ORDER？

Question

I have a code snippet here from the project I am working on.我正在处理的项目中有一个代码片段。 I don't understand what exactly is the function of ROW_NUMBER() and rownumber outside the parenthesis.我不明白 ROW_NUMBER() 的 function 和括号外的行号到底是什么。

select az.* , ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber

I am supposed to convert this code to pyspark.我应该将此代码转换为 pyspark。 Need some help in the conversion.在转换中需要一些帮助。

Answer 1

ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber means "break the results into groups where all rows in each group have the same value for txn_no/seq_no , then number them sequentially increasing in order of txn_no/seq_no (which doesn't make sense; the person who wrote this might not have known what they were doing) and the final rownumber is just an alias name for this generated column ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber表示“将结果分成组，其中每组中的所有行都具有相同的txn_no/seq_no值，然后按txn_no/seq_no的顺序依次增加它们的编号（这没有意义；写这篇文章的人可能不知道他们在做什么），最后的rownumber只是这个生成列的别名

All in, for a resultset like:总而言之，对于像这样的结果集：

txn_no, seq_no
123,    123
123,    123
123,    124
124,    124
124,    124

It would end up like:它最终会像：

txn_no, seq_no, rownumber
123,    123,    1
123,    123,    2
123,    124,    1
124,    124,    1
124,    124,    2
124,    124,    3
124,    124,    4

ie the rownumber restarts every time txn_no/seq_no pairing changes即每次 txn_no/seq_no 配对更改时，rownumber 都会重新启动

As noted, it doesn't make much sense.如前所述，这没有多大意义。 This is a more valid use of ROW_NUMBER:这是 ROW_NUMBER 的更有效用法：

person, address,      livedthereuntil
john,   palm springs, 2019-12-31
john,   capitol hill, 2020-12-31
john,   sunset strip, 2018-12-31

ROW_NUMBER() OVER(PARTITION BY person ORDER BY livedthereuntil DESC) rn

person, address,      livedthereuntil, rn
john,   capitol hill, 2020-12-31,      1
john,   palm springs, 2019-12-31,      2
john,   sunset strip, 2018-12-31,      3

Thus we can say John's most recent address is rn 1, or other things like "he lived in sunset strip before he lived in his previous house (rn=3)" as an answer to "where was he living 2 houses ago?"因此，我们可以说约翰最近的地址是 rn 1，或者其他诸如“他住在日落地带之前他住在他以前的房子（rn = 3）”之类的东西，作为对“他在 2 栋房子前住在哪里？”的回答。

Typically we partition by one thing but order by another before we start numbering.. for this reason.通常，在开始编号之前，我们按一件事进行分区，但按另一件事排序。出于这个原因。 Stuff like "Get me the 3 most recent transactions for every person who has an account with us" -> over(partition by account number order by transaction_date desc) as rn... where rn < 4 .诸如“为每个在我们这里有帐户的人获取我最近的 3 笔交易”之类的东西 -> over(partition by account number order by transaction_date desc) as rn... where rn < 4 。 It doesn't make sense to order by the same thing you partition by because logically everything in the partition has the same value, so ordering it is useless按分区所依据的相同事物进行排序是没有意义的，因为从逻辑上讲，分区中的所有内容都具有相同的值，因此对其进行排序是无用的

I don't know pyspark but I hope this helps you replicate whatever that row number was for我不知道 pyspark 但我希望这可以帮助您复制该行号的任何内容

Answer 2

The PySpark code to the Oracle SQL code written above is as follows:上面写的PySpark代码到Oracle SQL代码如下：

t3 = az.select(az["*"],(sf.row_number().over(Window.partitionBy("txn_no","seq_no").orderBy("txn_no","seq_no"))).alias("rownumber"))

Now as said above, order by here seems unwanted as it repeats the same cols which indeed result in continuously changing of row_numbers but thats what my project head is asking for.现在如上所述，order by here 似乎不需要，因为它重复相同的 cols，这确实导致 row_numbers 不断变化，但这就是我的项目负责人所要求的。 Changes to the answer above are highly welcomed!非常欢迎对上述答案进行更改！ Thank You!谢谢你！

如何在 Pyspark 中使用 ROW_NUMBER 转换 PARTITION_BY 和 ORDER？

问题描述

2 个解决方案

解决方案1
1 2021-02-12 20:58:19

解决方案2
0 2021-02-13 09:00:51

如何在 Pyspark 中使用 ROW_NUMBER 转换 PARTITION_BY 和 ORDER？

问题描述

2 个解决方案

解决方案1 1 2021-02-12 20:58:19

解决方案2 0 2021-02-13 09:00:51

解决方案1
1 2021-02-12 20:58:19

解决方案2
0 2021-02-13 09:00:51