简体   繁体   English

如何使用sqlContext计算累积和

[英]How to calculate cumulative sum using sqlContext

I know we can use Window function in pyspark to calculate cumulative sum. 我知道我们可以在pyspark中使用Window函数来计算累积和。 But Window is only supported in HiveContext and not in SQLContext. 但Window仅在HiveContext中支持,而不在SQLContext中支持。 I need to use SQLContext as HiveContext cannot be run in multi processes. 我需要使用SQLContext,因为HiveContext无法在多个进程中运行。

Is there any efficient way to calculate cumulative sum using SQLContext? 有没有有效的方法来使用SQLContext计算累积和? A simple way is to load the data into the driver's memory and use numpy.cumsum, but the con is the data need to be able to fit into the memory 一种简单的方法是将数据加载到驱动程序的内存中并使用numpy.cumsum,但con是需要能够装入内存的数据

Not sure if this is what you are looking for but here are two examples how to use sqlContext to calculate the cumulative sum: 不确定这是否是您正在寻找的,但这里有两个示例如何使用sqlContext来计算累积总和:

First when you want to partition it by some categories: 首先,当您想按某些类别对其进行分区时:

from pyspark.sql.types import StructType, StringType, LongType
from pyspark.sql import SQLContext

rdd = sc.parallelize([
    ("Tablet", 6500), 
    ("Tablet", 5500), 
    ("Cell Phone", 6000), 
    ("Cell Phone", 6500), 
    ("Cell Phone", 5500)
    ])

schema = StructType([
    StructField("category", StringType(), False),
    StructField("revenue", LongType(), False)
    ])

df = sqlContext.createDataFrame(rdd, schema)

df.registerTempTable("test_table")

df2 = sqlContext.sql("""
SELECT
    category,
    revenue,
    sum(revenue) OVER (PARTITION BY category ORDER BY revenue) as cumsum
FROM
test_table
""")

Output: 输出:

[Row(category='Tablet', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=6500, cumsum=12000),
 Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Cell Phone', revenue=6000, cumsum=11500),
 Row(category='Cell Phone', revenue=6500, cumsum=18000)]

Second when you only want to take the cumsum of one variable. 第二,当你只想拿一个变量的cumsum时。 Change df2 to this: 将df2更改为:

df2 = sqlContext.sql("""
SELECT
    category,
    revenue,
    sum(revenue) OVER (ORDER BY revenue, category) as cumsum
FROM
test_table
""")

Output: 输出:

[Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=5500, cumsum=11000),
 Row(category='Cell Phone', revenue=6000, cumsum=17000),
 Row(category='Cell Phone', revenue=6500, cumsum=23500),
 Row(category='Tablet', revenue=6500, cumsum=30000)]

Hope this helps. 希望这可以帮助。 Using np.cumsum is not very efficient after collecting the data especially if the dataset is large. 收集数据后使用np.cumsum效率不高,尤其是在数据集很大的情况下。 Another way you could explore is to use simple RDD transformations like groupByKey() and then use map to calculate the cumulative sum of each group by some key and then reduce it at the end. 您可以探索的另一种方法是使用简单的RDD转换,例如groupByKey(),然后使用map通过某个键计算每个组的累积总和,然后在最后减少它。

Here is a simple example: 这是一个简单的例子:

import pyspark
from pyspark.sql import window
import pyspark.sql.functions as sf


sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)

data = sqlcontext.createDataFrame([("Bob", "M", "Boston", 1, 20),
                                   ("Cam", "F", "Cambridge", 1, 25),
                                  ("Lin", "F", "Cambridge", 1, 25),
                                  ("Cat", "M", "Boston", 1, 20),
                                  ("Sara", "F", "Cambridge", 1, 15),
                                  ("Jeff", "M", "Cambridge", 1, 25),
                                  ("Bean", "M", "Cambridge", 1, 26),
                                  ("Dave", "M", "Cambridge", 1, 21),], 
                                 ["name", 'gender', "city", 'donation', "age"])


data.show()

gives output 给出输出

+----+------+---------+--------+---+
|name|gender|     city|donation|age|
+----+------+---------+--------+---+
| Bob|     M|   Boston|       1| 20|
| Cam|     F|Cambridge|       1| 25|
| Lin|     F|Cambridge|       1| 25|
| Cat|     M|   Boston|       1| 20|
|Sara|     F|Cambridge|       1| 15|
|Jeff|     M|Cambridge|       1| 25|
|Bean|     M|Cambridge|       1| 26|
|Dave|     M|Cambridge|       1| 21|
+----+------+---------+--------+---+

Define a window 定义一个窗口

win_spec = (window.Window
                  .partitionBy(['gender', 'city'])
                  .rowsBetween(window.Window.unboundedPreceding, 0))

# window.Window.unboundedPreceding -- first row of the group # .rowsBetween(..., 0) -- 0 refers to current row, if instead -2 specified then upto 2 rows before current row #window.Window.unboundedPreceding - 组的第一行#.rowsBetween(...,0) - 0表示当前行,如果指定-2指定当前行之前最多2行

Now, here is a trap: 现在,这是一个陷阱:

temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))

with error : 有错误:

TypeErrorTraceback (most recent call last)
<ipython-input-9-b467d24b05cd> in <module>()
----> 1 temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))

/Users/mupadhye/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.pyc in __iter__(self)
    238 
    239     def __iter__(self):
--> 240         raise TypeError("Column is not iterable")
    241 
    242     # string methods

TypeError: Column is not iterable

This is due to using python's sum function instead of pyspark's . 这是因为使用python的sum函数而不是pyspark's The way to fix this is using sum function from pyspark.sql.functions.sum : 解决这个问题的方法是使用pyspark.sql.functions.sum sum函数:

temp = data.withColumn('AgeSum',sf.sum(data.donation).over(win_spec))
temp.show()

will give: 会给:

+----+------+---------+--------+---+--------------+
|name|gender|     city|donation|age|CumSumDonation|
+----+------+---------+--------+---+--------------+
|Sara|     F|Cambridge|       1| 15|             1|
| Cam|     F|Cambridge|       1| 25|             2|
| Lin|     F|Cambridge|       1| 25|             3|
| Bob|     M|   Boston|       1| 20|             1|
| Cat|     M|   Boston|       1| 20|             2|
|Dave|     M|Cambridge|       1| 21|             1|
|Jeff|     M|Cambridge|       1| 25|             2|
|Bean|     M|Cambridge|       1| 26|             3|
+----+------+---------+--------+---+--------------+

After landing on this thread trying to solve a similar problem, I've solved my issue using this code. 登陆此线程试图解决类似的问题后,我已经使用此代码解决了我的问题。 Not sure if I'm missing part of the OP, but this is a way to sum a SQLContext column: 不确定我是否缺少OP的一部分,但这是一种总结SQLContext列的方法:

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext

sc = SparkContext() 
sc.setLogLevel("ERROR")
conf = SparkConf()
conf.setAppName('Sum SQLContext Column')
conf.set("spark.executor.memory", "2g")
sqlContext = SQLContext(sc)

def sum_column(table, column):
    sc_table = sqlContext.table(table)
    return sc_table.agg({column: "sum"})

sum_column("db.tablename", "column").show()

It is not true that windows function works only with HiveContext. Windows函数不适用于HiveContext。 You can use them even with sqlContext : 即使使用sqlContext也可以使用它们:

from pyspark.sql.window import *

myPartition=Window.partitionBy(['col1','col2','col3'])

temp= temp.withColumn("#dummy",sum(temp.col4).over(myPartition))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM