[英]How to reduce and sum grids with in Scala Spark DF
是否可以將 Scala Spark DF 中的 nxn 網格減少到網格的總和並創建新的 df? 現有的df:
1 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 0 0 0 0 1 1
0 1 0 0 0 0 1 0
0 0 0 0 1 0 0 0
如果 n = 4 那么我們可以從這個 df 中取出 4x4 網格,將它們相加嗎?
1 1 0 0 | 0 0 0 0
0 0 0 0 | 0 0 1 0
0 1 0 0 | 0 0 0 0
0 0 0 0 | 0 0 0 0
------------------
0 0 0 0 | 0 0 0 0
0 1 0 0 | 0 0 1 1
0 1 0 0 | 0 0 1 0
0 0 0 0 | 1 0 0 0
並得到這個 output?
3 1
2 4
對於行明智,您必須聚合,而對於列明智,您必須求和。 2x2 的示例代碼
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window
#Create test data frame
tst= sqlContext.createDataFrame([(1,1,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17),(2,13,14,17)],schema=['col1','col2','col3','col4'])
w=Window.orderBy(F.monotonically_increasing_id())
tst1= tst.withColumn("grp",F.ceil(F.row_number().over(w)/2)) # 2 is for this example - change to 4
tst_sum_row = tst1.groupby('grp').agg(*[F.sum(coln).alias(coln) for coln in tst1.columns if 'grp' not in coln])
expr =[sum([F.col(tst.columns[i]),F.col(tst.columns[i+1])]).alias('coln'+str(i)) for i in [x*2 for x in (range(len(tst.columns)/2))]] # The sum used here is python inbuilt sum and not pyspark sum function which is referred as F.sum()
tst_sum_coln = tst_sum_row.select(*expr)
tst.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 1| 2| 11|
| 1| 3| 4| 12|
| 1| 5| 6| 13|
| 1| 7| 8| 14|
| 2| 9| 10| 15|
| 2| 11| 12| 16|
| 2| 13| 14| 17|
| 2| 13| 14| 17|
+----+----+----+----+
In [21]: tst_sum_coln.show()
+-----+-----+
|coln0|coln2|
+-----+-----+
| 6| 29|
| 14| 41|
| 24| 53|
| 30| 62|
+-----+-----+
檢查下面的代碼。
scala> df.show(false)
+---+---+---+---+---+---+---+---+
|a |b |c |d |e |f |g |h |
+---+---+---+---+---+---+---+---+
|1 |1 |0 |0 |0 |0 |0 |0 |
|0 |0 |0 |0 |0 |0 |1 |0 |
|0 |1 |0 |0 |0 |0 |0 |0 |
|0 |0 |0 |0 |0 |0 |0 |0 |
|0 |0 |0 |0 |0 |0 |0 |0 |
|0 |1 |0 |0 |0 |0 |1 |1 |
|0 |1 |0 |0 |0 |0 |1 |0 |
|0 |0 |0 |0 |1 |0 |0 |0 |
+---+---+---+---+---+---+---+---+
scala> val n = 4
這會將行划分或分組為 2,每組有 4 行數據。
scala> val rowExpr = ntile(n/2)
.over(
Window
.orderBy(lit(1))
)
將所有值收集到數組數組中。
scala> val aggExpr = df
.columns
.grouped(4)
.toList.map(c => collect_list(array(c.map(col):_*)).as(c.mkString))
展平數組,刪除 0 值並獲取數組的大小。
scala> val selectExpr = df
.columns
.grouped(4)
.toList
.map(c => size(array_remove(flatten(col(c.mkString)),0)).as(c.mkString))
應用rowExpr
& selectExpr
scala> df
.withColumn("row_id",rowExpr)
.groupBy($"row_id")
.agg(aggExpr.head,aggExpr.tail:_*)
.select(selectExpr:_*)
.show(false)
最終 Output
+----+----+
|abcd|efgh|
+----+----+
|3 |1 |
|2 |4 |
+----+----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.