[英]How to get row_number is pyspark dataframe
In order to rank, i need to get the row_number is a pyspark dataframe. 为了排名,我需要获得row_number是一个pyspark数据帧。 I saw that there is row_number function in the windows function of pyspark but this is require using HiveContext.
我看到pyspark的Windows函数中有row_number函数,但这是使用HiveContext所必需的。
I tried to replace the sqlContext with HiveContext 我试图用HiveContext替换sqlContext
import pyspark
self.sc = pyspark.SparkContext()
#self.sqlContext = pyspark.sql.SQLContext(self.sc)
self.sqlContext = pyspark.sql.HiveContext(self.sc)
But it now throws exception TypeError: 'JavaPackage' object is not callable Can you help in either operating the HiveContext or to get the row number in a different way? 但是它现在引发异常TypeError:'JavaPackage'对象不可调用。您可以帮助操作HiveContext还是以其他方式获取行号?
Example of data: I want to first rank by my prediction and then calculate a loss function (ndcg) based on this ranking. 数据示例:我想首先根据预测进行排名,然后根据此排名计算损失函数(ndcg)。 In order to calculate the loss function i will nee the ranking (ie the position of the prediction in the sorting)
为了计算损失函数,我将进行排名(即预测在排序中的位置)
So the first step is to sort the data by pred but then i need a running counter of the sorted data. 因此,第一步是按pred对数据进行排序,但随后我需要一个运行中的已排序数据计数器。
+-----+--------------------+
|label|pred|
+-----+--------------------+
| 1.0|[0.25313606997906...|
| 0.0|[0.40893413256608...|
| 0.0|[0.18353492079000...|
| 0.0|[0.77719741215204...|
| 1.0|[0.62766290642569...|
| 1.0|[0.40893413256608...|
| 1.0|[0.63084085591913...|
| 0.0|[0.77719741215204...|
| 1.0|[0.36752166787523...|
| 0.0|[0.40893413256608...|
| 1.0|[0.25528507573737...|
| 1.0|[0.25313606997906...|
Thanks. 谢谢。
You don't need to create the HiveContext
if your data is not in Hive. 如果您的数据不在Hive中,则无需创建
HiveContext
。 You can just carry on with your sqlContext
. 您可以继续使用
sqlContext
。
There is no row_number
for your dataframe unless you create one. 除非您创建一个,否则您的数据框没有
row_number
。 pyspark.sql.functions.row_number
` is for a different purpose and it only works with a windowed partition. pyspark.sql.functions.row_number
用于其他目的,并且仅适用于窗口分区。
What you need may be to create a new column as the row_id
using monotonically_increasing_id
then query it later. 您可能需要使用
monotonically_increasing_id
创建一个新列作为row_id
然后再查询。
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import Row
data = sc.parallelize([
Row(key=1, val='a'),
Row(key=2, val='b'),
Row(key=3, val='c'),
]).toDF()
data = data.withColumn(
'row_id',
monotonically_increasing_id()
)
data.collect()
Out[8]:
[Row(key=1, val=u'a', row_id=17179869184),
Row(key=2, val=u'b', row_id=42949672960),
Row(key=3, val=u'c', row_id=60129542144)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.