在Spark中以非恒定帧大小应用窗口功能

Question

My Problem 我的问题

I am currently facing difficulties with Spark window functions. 我目前在使用Spark窗口功能时遇到困难。 I am using Spark (through pyspark) version 1.6.3 (associated Python version 2.6.6 ). 我正在使用Spark（通过pyspark）版本1.6.3 （关联的Python版本2.6.6 ）。 I run a pyspark shell instance that automatically initializes HiveContext as my sqlContext . 我运行一个pyspark shell实例，该实例自动将HiveContext初始化为sqlContext 。

I want to do a rolling sum with window function. 我想用window函数进行滚动。 My problem is that the window frame is not fixed: it depends on the observation we consider. 我的问题是窗框不是固定的：它取决于我们考虑的观察结果。 To be more specific, I order data by a variable called rank_id and want to do rolling sum, for any observation indexed $x$ between indexes $x+1$ and $2x-1$. 更具体地说，我对变量$ x + 1 $和$ 2x-1 $之间的任何索引为$ x $的观测值通过名为rank_id的变量进行rank_id并希望进行滚动总和。 Thus, my rangeBetween must depend on the rank_id variable value. 因此，我的rangeBetween必须取决于rank_id变量值。

An important point is that I don't want to collect data thus cannot use anything like numpy (my data have many many observations). 重要的一点是我不想收集数据，因此不能使用诸如numpy东西（我的数据有很多观察结果）。

Reproducible example 可复制的例子

from pyspark.mllib.random import RandomRDDs
import pyspark.sql.functions as psf
from pyspark.sql.window import Window

# Reproducible example
data = RandomRDDs.uniformVectorRDD(sc, 15, 2)
df = data.map(lambda l: (float(l[0]), float(l[1]))).toDF()
df = df.selectExpr("_1 as x", "_2 as y")

#df.show(2)
#+-------------------+------------------+                                        
#|                  x|                 y|
#+-------------------+------------------+
#|0.32767742062486405|0.2506351566289311|
#| 0.7245348534550357| 0.597929853274274|
#+-------------------+------------------+
#only showing top 2 rows

# Finalize dataframe creation
w = Window().orderBy("x")
df = df.withColumn("rank_id", psf.rowNumber().over(w)).sort("rank_id")
#df.show(3)
#+--------------------+--------------------+-------+                             
#|                   x|                   y|rank_id|
#+--------------------+--------------------+-------+
#|0.016536160706045577|0.009892450530381458|      1|
#| 0.10943843181953838|  0.6478505849227775|      2|
#| 0.13916818312857027| 0.24165348228464578|      3|
#+--------------------+--------------------+-------+
#only showing top 3 rows

Fixed width cumulative sum: no problem 固定宽度累计总和：没问题

Using window function, I am able to run a cumulative sum on a given number of indexes (I use here rangeBetween but for this example rowBetween could be used indifferently). 使用window函数，我可以对给定数量的索引运行累加和（我在这里使用rangeBetween但是对于本示例， rowBetween可以无差别地使用）。

w = Window.orderBy('rank_id').rangeBetween(-1,3)
df1 = df.select('*', psf.sum(df['y']).over(w).alias('roll1'))
#df1.show(3)
#+--------------------+--------------------+-------+------------------+          
#|                   x|                   y|rank_id|             roll1|
#+--------------------+--------------------+-------+------------------+
#|0.016536160706045577|0.009892450530381458|      1|0.9698521852602887|
#| 0.10943843181953838|  0.6478505849227775|      2|1.5744700156326066|
#| 0.13916818312857027| 0.24165348228464578|      3|2.3040547273760392|
#+--------------------+--------------------+-------+------------------+
#only showing top 3 rows

Cumulative sum width not fixed 累计总和宽度不固定

I want to sum between indexes x+1 and 2x-1 where x is my row index. 我想在索引x + 1和2x-1之间求和，其中x是我的行索引。 When I try to pass it to Spark (in similar way we do for orderBy maybe that's the problem), I got the following error 当我尝试将其传递给Spark时（以类似于我们为orderBy所做的方式，也许就是问题所在），我遇到了以下错误

# Now if I want to make rangeBetween size depend on a variable
w = Window.orderBy('rank_id').rangeBetween('rank_id'+1,2*'rank_id'-1)

Traceback (most recent call last): File "", line 1, in TypeError: cannot concatenate 'str' and 'int' objects 追溯（最近一次调用最近）：TypeError中的文件“”，第1行，无法连接“ str”和“ int”对象

I tried something else, using SQL statement 我尝试使用SQL语句进行其他操作

# Using SQL expression
df.registerTempTable('tempdf')
df2 = sqlContext.sql("""
   SELECT *, SUM(y)
   OVER (ORDER BY rank_id
   RANGE BETWEEN rank_id+1 AND 2*rank_id-1) AS cumsum
   FROM tempdf;
""")

which this times gives me the following error 这一次给我以下错误

Traceback (most recent call last): File "", line 6, in File "/opt/application/Spark/current/python/pyspark/sql/context.py", line >580, in sql return DataFrame(self._ssql_ctx.sql(sqlQuery), self) File "/opt/application/Spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call File "/opt/application/Spark/current/python/pyspark/sql/utils.py", line 51, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"cannot recognize input near 'rank_id' '+' '1' in windowframeboundary; line 3 pos 15" 追溯（最近一次调用最近）：文件“”，行6，在文件“ /opt/application/Spark/current/python/pyspark/sql/context.py”中，行> 580，在sql中返回DataFrame（self._ssql_ctx .sql（sqlQuery），self）文件“ /opt/application/Spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py”，行813，在调用文件“ / opt / application /装饰中引发Spark / current / python / pyspark / sql / utils.py“，第51行，引发AnalysisException（s.split（'：'，1）[1]，stackTrace）pyspark.sql.utils.AnalysisException：u”无法在windowframeboundary中识别'rank_id''+''1'附近的输入；第3行pos 15“

I also noticed that when I try a more simple statement using SQL OVER clause, I got a similar error which maybe means I am not passing SQL statement correctly to Spark 我还注意到，当我尝试使用SQL OVER子句SQL OVER更简单的语句时，遇到了类似的错误，这可能意味着我没有将SQL语句正确传递给Spark

df2 = sqlContext.sql("""
   SELECT *, SUM(y)
   OVER (ORDER BY rank_id
   RANGE BETWEEN -1 AND 1) AS cumsum
   FROM tempdf;
 """)

Traceback (most recent call last): File "", line 6, in File "/opt/application/Spark/current/python/pyspark/sql/context.py", line 580, in sql return DataFrame(self._ssql_ctx.sql(sqlQuery), self) File "/opt/application/Spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call File "/opt/application/Spark/current/python/pyspark/sql/utils.py", line 51, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"cannot recognize input near '-' '1' 'AND' in windowframeboundary; line 3 pos 15" 追溯（最近一次通话最近）：文件“ /opt/application/Spark/current/python/pyspark/sql/context.py”中的文件580行，在SQL中的行580，返回DataFrame（self._ssql_ctx。 sql（sqlQuery），self）文件“ /opt/application/Spark/current/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py”，行813，在调用文件“ / opt / application / Spark”中/current/python/pyspark/sql/utils.py“，第51行，在装饰中引发AnalysisException（s.split（'：'，1）[1]，stackTrace）pyspark.sql.utils.AnalysisException：u”无法识别在windowframeboundary中的'-''1''AND'附近输入；第3行pos 15“

How could I solve my problem by using either window or SQL statement within Spark? 如何在Spark中使用window或SQL语句解决问题？

Answer 1

How could I solve my problem by using either window or SQL statement within Spark? 如何在Spark中使用window或SQL语句解决问题？

TL;DR You cannot, or at least not in a scalable way, with current requirements. TL; DR您不能或至少不能以可伸缩的方式满足当前要求。 You can try something similar to sliding over RDD: How to transform data with sliding window over time series data in Pyspark 您可以尝试类似于在RDD上滑动的操作：如何在Pyspark中的时间序列数据上使用滑动窗口转换数据

I also noticed that when I try a more simple statement using SQL OVER clause, I got a similar error which maybe means I am not passing SQL statement correctly to Spark 我还注意到，当我尝试使用SQL OVER子句编写更简单的语句时，遇到了类似的错误，这可能意味着我没有将SQL语句正确传递给Spark

It is incorrect. 不正确 Range specification requires ( PRECEDING | FOLLOWING | CURRENT_ROW ) specification. 范围规范要求（ PRECEDING | FOLLOWING | CURRENT_ROW ）规范。 Also there should be no semicolon: 也不应有分号：

SELECT *, SUM(x)
OVER (ORDER BY rank_id
RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS cumsum
FROM tempdf

I want to sum between indexes x+1 and 2x-1 where x is my row index. 我想在索引x + 1和2x-1之间求和，其中x是我的行索引。 When I try to pass it to Spark (in similar way we do for orderBy maybe that's the problem), I got the following error ... 当我尝试将其传递给Spark时（以类似的方式处理orderBy也许就是问题所在），我遇到了以下错误...

TypeError: cannot concatenate 'str' and 'int' objects TypeError：无法连接“ str”和“ int”对象

As exception says - you cannot call + on string and integer. 如异常所示-您不能在字符串和整数上调用+ 。 You probably wanted columns: 您可能想要以下列：

from pyspark.sql.functions import col

.rangeBetween(col('rank_id') + 1,  2* col('rank_id') - 1)

but this is not supported. 但是不支持。 Range has to be of fixed size and cannot be defined in terms of expressions. 范围必须是固定大小，不能根据表达式进行定义。

An important point is that I don't want to collect data 重要的一点是我不想收集数据

Window definition without partitionBy : 没有partitionBy窗口定义：

w = Window.orderBy('rank_id').rangeBetween(-1,3)

is as bad as collect. 和收集一样糟糕。 So even if there are workarounds for "dynamic frame" (with conditionals and unbounded window) problem, they won't help you here. 因此，即使有针对“动态框架”（带有条件和无边界窗口）问题的解决方法，它们也无法在这里为您提供帮助。

在Spark中以非恒定帧大小应用窗口功能

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-01-10 11:03:15

在Spark中以非恒定帧大小应用窗口功能

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-01-10 11:03:15

解决方案1
0 已采纳 2018-01-10 11:03:15