[英]How to convert PARTITION_BY and ORDER with ROW_NUMBER in Pyspark?
I have a code snippet here from the project I am working on.我正在处理的项目中有一个代码片段。 I don't understand what exactly is the function of ROW_NUMBER() and rownumber outside the parenthesis.我不明白 ROW_NUMBER() 的 function 和括号外的行号到底是什么。
select az.* , ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber
I am supposed to convert this code to pyspark.我应该将此代码转换为 pyspark。 Need some help in the conversion.在转换中需要一些帮助。
ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber
means "break the results into groups where all rows in each group have the same value for txn_no/seq_no
, then number them sequentially increasing in order of txn_no/seq_no
(which doesn't make sense; the person who wrote this might not have known what they were doing) and the final rownumber
is just an alias name for this generated column ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber
表示“将结果分成组,其中每组中的所有行都具有相同的txn_no/seq_no
值,然后按txn_no/seq_no
的顺序依次增加它们的编号(这没有意义;写这篇文章的人可能不知道他们在做什么),最后的rownumber
只是这个生成列的别名
All in, for a resultset like:总而言之,对于像这样的结果集:
txn_no, seq_no
123, 123
123, 123
123, 124
124, 124
124, 124
It would end up like:它最终会像:
txn_no, seq_no, rownumber
123, 123, 1
123, 123, 2
123, 124, 1
124, 124, 1
124, 124, 2
124, 124, 3
124, 124, 4
ie the rownumber restarts every time txn_no/seq_no pairing changes即每次 txn_no/seq_no 配对更改时,rownumber 都会重新启动
As noted, it doesn't make much sense.如前所述,这没有多大意义。 This is a more valid use of ROW_NUMBER:这是 ROW_NUMBER 的更有效用法:
person, address, livedthereuntil
john, palm springs, 2019-12-31
john, capitol hill, 2020-12-31
john, sunset strip, 2018-12-31
ROW_NUMBER() OVER(PARTITION BY person ORDER BY livedthereuntil DESC) rn
person, address, livedthereuntil, rn
john, capitol hill, 2020-12-31, 1
john, palm springs, 2019-12-31, 2
john, sunset strip, 2018-12-31, 3
Thus we can say John's most recent address is rn 1, or other things like "he lived in sunset strip before he lived in his previous house (rn=3)" as an answer to "where was he living 2 houses ago?"因此,我们可以说约翰最近的地址是 rn 1,或者其他诸如“他住在日落地带之前他住在他以前的房子(rn = 3)”之类的东西,作为对“他在 2 栋房子前住在哪里?”的回答。
Typically we partition by one thing but order by another before we start numbering.. for this reason.通常,在开始编号之前,我们按一件事进行分区,但按另一件事排序。出于这个原因。 Stuff like "Get me the 3 most recent transactions for every person who has an account with us" -> over(partition by account number order by transaction_date desc) as rn... where rn < 4
.诸如“为每个在我们这里有帐户的人获取我最近的 3 笔交易”之类的东西 -> over(partition by account number order by transaction_date desc) as rn... where rn < 4
。 It doesn't make sense to order by the same thing you partition by because logically everything in the partition has the same value, so ordering it is useless按分区所依据的相同事物进行排序是没有意义的,因为从逻辑上讲,分区中的所有内容都具有相同的值,因此对其进行排序是无用的
I don't know pyspark but I hope this helps you replicate whatever that row number was for我不知道 pyspark 但我希望这可以帮助您复制该行号的任何内容
The PySpark code to the Oracle SQL code written above is as follows:上面写的PySpark代码到Oracle SQL代码如下:
t3 = az.select(az["*"],(sf.row_number().over(Window.partitionBy("txn_no","seq_no").orderBy("txn_no","seq_no"))).alias("rownumber"))
Now as said above, order by here seems unwanted as it repeats the same cols which indeed result in continuously changing of row_numbers but thats what my project head is asking for.现在如上所述,order by here 似乎不需要,因为它重复相同的 cols,这确实导致 row_numbers 不断变化,但这就是我的项目负责人所要求的。 Changes to the answer above are highly welcomed!非常欢迎对上述答案进行更改! Thank You!谢谢你!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.