如何获取row_number是pyspark数据帧

Question

In order to rank, i need to get the row_number is a pyspark dataframe. 为了排名，我需要获得row_number是一个pyspark数据帧。 I saw that there is row_number function in the windows function of pyspark but this is require using HiveContext. 我看到pyspark的Windows函数中有row_number函数，但这是使用HiveContext所必需的。

I tried to replace the sqlContext with HiveContext 我试图用HiveContext替换sqlContext

        import pyspark
        self.sc = pyspark.SparkContext()
        #self.sqlContext = pyspark.sql.SQLContext(self.sc)
        self.sqlContext = pyspark.sql.HiveContext(self.sc)

But it now throws exception TypeError: 'JavaPackage' object is not callable Can you help in either operating the HiveContext or to get the row number in a different way? 但是它现在引发异常TypeError：'JavaPackage'对象不可调用。您可以帮助操作HiveContext还是以其他方式获取行号？

Example of data: I want to first rank by my prediction and then calculate a loss function (ndcg) based on this ranking. 数据示例：我想首先根据预测进行排名，然后根据此排名计算损失函数（ndcg）。 In order to calculate the loss function i will nee the ranking (ie the position of the prediction in the sorting) 为了计算损失函数，我将进行排名（即预测在排序中的位置）

So the first step is to sort the data by pred but then i need a running counter of the sorted data. 因此，第一步是按pred对数据进行排序，但随后我需要一个运行中的已排序数据计数器。

+-----+--------------------+
|label|pred|
+-----+--------------------+

|  1.0|[0.25313606997906...|
|  0.0|[0.40893413256608...|
|  0.0|[0.18353492079000...|
|  0.0|[0.77719741215204...|
|  1.0|[0.62766290642569...|
|  1.0|[0.40893413256608...|
|  1.0|[0.63084085591913...|
|  0.0|[0.77719741215204...|
|  1.0|[0.36752166787523...|
|  0.0|[0.40893413256608...|
|  1.0|[0.25528507573737...|
|  1.0|[0.25313606997906...|

Thanks. 谢谢。

Answer 1

You don't need to create the HiveContext if your data is not in Hive. 如果您的数据不在Hive中，则无需创建HiveContext 。 You can just carry on with your sqlContext . 您可以继续使用sqlContext 。

There is no row_number for your dataframe unless you create one. 除非您创建一个，否则您的数据框没有row_number 。 pyspark.sql.functions.row_number ` is for a different purpose and it only works with a windowed partition. pyspark.sql.functions.row_number用于其他目的，并且仅适用于窗口分区。

What you need may be to create a new column as the row_id using monotonically_increasing_id then query it later. 您可能需要使用monotonically_increasing_id创建一个新列作为row_id然后再查询。

from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import Row

data = sc.parallelize([
  Row(key=1, val='a'),
  Row(key=2, val='b'), 
  Row(key=3, val='c'), 
]).toDF()

data = data.withColumn(
  'row_id',
  monotonically_increasing_id()
)

data.collect()


Out[8]: 
[Row(key=1, val=u'a', row_id=17179869184),
 Row(key=2, val=u'b', row_id=42949672960),
 Row(key=3, val=u'c', row_id=60129542144)]

如何获取row_number是pyspark数据帧

问题描述

1 个解决方案

解决方案1
3 2016-10-31 11:08:45

如何获取row_number是pyspark数据帧

问题描述

1 个解决方案

解决方案1 3 2016-10-31 11:08:45

解决方案1
3 2016-10-31 11:08:45