简体   繁体   English

如何在 Pyspark 中使用 ROW_NUMBER 转换 PARTITION_BY 和 ORDER?

[英]How to convert PARTITION_BY and ORDER with ROW_NUMBER in Pyspark?

I have a code snippet here from the project I am working on.我正在处理的项目中有一个代码片段。 I don't understand what exactly is the function of ROW_NUMBER() and rownumber outside the parenthesis.我不明白 ROW_NUMBER() 的 function 和括号外的行号到底是什么。

select az.* , ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber

I am supposed to convert this code to pyspark.我应该将此代码转换为 pyspark。 Need some help in the conversion.在转换中需要一些帮助。

ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber means "break the results into groups where all rows in each group have the same value for txn_no/seq_no , then number them sequentially increasing in order of txn_no/seq_no (which doesn't make sense; the person who wrote this might not have known what they were doing) and the final rownumber is just an alias name for this generated column ROW_NUMBER () OVER (PARTITION BY txn_no, seq_no order by txn_no, seq_no)rownumber表示“将结果分成组,其中每组中的所有行都具有相同的txn_no/seq_no值,然后按txn_no/seq_no的顺序依次增加它们的编号(这没有意义;写这篇文章的人可能不知道他们在做什么),最后的rownumber只是这个生成列的别名

All in, for a resultset like:总而言之,对于像这样的结果集:

txn_no, seq_no
123,    123
123,    123
123,    124
124,    124
124,    124

It would end up like:它最终会像:

txn_no, seq_no, rownumber
123,    123,    1
123,    123,    2
123,    124,    1
124,    124,    1
124,    124,    2
124,    124,    3
124,    124,    4

ie the rownumber restarts every time txn_no/seq_no pairing changes即每次 txn_no/seq_no 配对更改时,rownumber 都会重新启动

As noted, it doesn't make much sense.如前所述,这没有多大意义。 This is a more valid use of ROW_NUMBER:这是 ROW_NUMBER 的更有效用法:

person, address,      livedthereuntil
john,   palm springs, 2019-12-31
john,   capitol hill, 2020-12-31
john,   sunset strip, 2018-12-31

ROW_NUMBER() OVER(PARTITION BY person ORDER BY livedthereuntil DESC) rn

person, address,      livedthereuntil, rn
john,   capitol hill, 2020-12-31,      1
john,   palm springs, 2019-12-31,      2
john,   sunset strip, 2018-12-31,      3

Thus we can say John's most recent address is rn 1, or other things like "he lived in sunset strip before he lived in his previous house (rn=3)" as an answer to "where was he living 2 houses ago?"因此,我们可以说约翰最近的地址是 rn 1,或者其他诸如“他住在日落地带之前他住在他以前的房子(rn = 3)”之类的东西,作为对“他在 2 栋房子前住在哪里?”的回答。

Typically we partition by one thing but order by another before we start numbering.. for this reason.通常,在开始编号之前,我们按一件事进行分区,但按另一件事排序。出于这个原因。 Stuff like "Get me the 3 most recent transactions for every person who has an account with us" -> over(partition by account number order by transaction_date desc) as rn... where rn < 4 .诸如“为每个在我们这里有帐户的人获取我最近的 3 笔交易”之类的东西 -> over(partition by account number order by transaction_date desc) as rn... where rn < 4 It doesn't make sense to order by the same thing you partition by because logically everything in the partition has the same value, so ordering it is useless按分区所依据的相同事物进行排序是没有意义的,因为从逻辑上讲,分区中的所有内容都具有相同的值,因此对其进行排序是无用的

I don't know pyspark but I hope this helps you replicate whatever that row number was for我不知道 pyspark 但我希望这可以帮助您复制该行号的任何内容

The PySpark code to the Oracle SQL code written above is as follows:上面写的PySpark代码到Oracle SQL代码如下:

t3 = az.select(az["*"],(sf.row_number().over(Window.partitionBy("txn_no","seq_no").orderBy("txn_no","seq_no"))).alias("rownumber"))

Now as said above, order by here seems unwanted as it repeats the same cols which indeed result in continuously changing of row_numbers but thats what my project head is asking for.现在如上所述,order by here 似乎不需要,因为它重复相同的 cols,这确实导致 row_numbers 不断变化,但这就是我的项目负责人所要求的。 Changes to the answer above are highly welcomed!非常欢迎对上述答案进行更改! Thank You!谢谢你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM