简体   繁体   English

如何获取row_number是pyspark数据帧

[英]How to get row_number is pyspark dataframe

In order to rank, i need to get the row_number is a pyspark dataframe. 为了排名,我需要获得row_number是一个pyspark数据帧。 I saw that there is row_number function in the windows function of pyspark but this is require using HiveContext. 我看到pyspark的Windows函数中有row_number函数,但这是使用HiveContext所必需的。

I tried to replace the sqlContext with HiveContext 我试图用HiveContext替换sqlContext

        import pyspark
        self.sc = pyspark.SparkContext()
        #self.sqlContext = pyspark.sql.SQLContext(self.sc)
        self.sqlContext = pyspark.sql.HiveContext(self.sc)

But it now throws exception TypeError: 'JavaPackage' object is not callable Can you help in either operating the HiveContext or to get the row number in a different way? 但是它现在引发异常TypeError:'JavaPackage'对象不可调用。您可以帮助操作HiveContext还是以其他方式获取行号?

Example of data: I want to first rank by my prediction and then calculate a loss function (ndcg) based on this ranking. 数据示例:我想首先根据预测进行排名,然后根据此排名计算损失函数(ndcg)。 In order to calculate the loss function i will nee the ranking (ie the position of the prediction in the sorting) 为了计算损失函数,我将进行排名(即预测在排序中的位置)

So the first step is to sort the data by pred but then i need a running counter of the sorted data. 因此,第一步是按pred对数据进行排序,但随后我需要一个运行中的已排序数据计数器。

+-----+--------------------+
|label|pred|
+-----+--------------------+

|  1.0|[0.25313606997906...|
|  0.0|[0.40893413256608...|
|  0.0|[0.18353492079000...|
|  0.0|[0.77719741215204...|
|  1.0|[0.62766290642569...|
|  1.0|[0.40893413256608...|
|  1.0|[0.63084085591913...|
|  0.0|[0.77719741215204...|
|  1.0|[0.36752166787523...|
|  0.0|[0.40893413256608...|
|  1.0|[0.25528507573737...|
|  1.0|[0.25313606997906...|

Thanks. 谢谢。

You don't need to create the HiveContext if your data is not in Hive. 如果您的数据不在Hive中,则无需创建HiveContext You can just carry on with your sqlContext . 您可以继续使用sqlContext

There is no row_number for your dataframe unless you create one. 除非您创建一个,否则您的数据框没有row_number pyspark.sql.functions.row_number ` is for a different purpose and it only works with a windowed partition. pyspark.sql.functions.row_number用于其他目的,并且仅适用于窗口分区。

What you need may be to create a new column as the row_id using monotonically_increasing_id then query it later. 您可能需要使用monotonically_increasing_id创建一个新列作为row_id然后再查询。

from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import Row

data = sc.parallelize([
  Row(key=1, val='a'),
  Row(key=2, val='b'), 
  Row(key=3, val='c'), 
]).toDF()

data = data.withColumn(
  'row_id',
  monotonically_increasing_id()
)

data.collect()


Out[8]: 
[Row(key=1, val=u'a', row_id=17179869184),
 Row(key=2, val=u'b', row_id=42949672960),
 Row(key=3, val=u'c', row_id=60129542144)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM