簡體   English   中英

Pyspark pivot 對多個列名

[英]Pyspark pivot on multiple column names

我目前有一個 dataframe df

id | c1   | c2   | c3 |
1  | diff | same | diff
2  | same | same | same
3  | diff | same | same
4  | same | same | same

我希望我的 output 看起來像

name| diff | same
c1  |   2  | 2
c2  |   0  | 4
c3  |   1  | 3

當我嘗試:

df.groupby('c2').pivot('c2').count() -> transformation A

|f2   | diff | same |
|same | null |  2
|diff | 2    |  null

我假設我需要為每一列編寫一個循環並通過轉換 A 傳遞它? 但是我在正確轉換 A 時遇到問題。 請幫忙

Pivot是一種昂貴的shuffle操作,應盡可能避免。 嘗試將此邏輯與arrays_zip and explode一起使用並展開以動態折疊列groupby-aggregate

from pyspark.sql import functions as F   

df.withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(F.col(x),F.lit(x))\
                                                    for x in df.columns if x!='id']))))\
  .withColumn("name", F.col("cols.0")[1]).withColumn("val", F.col("cols.0")[0]).drop("cols")\
  .groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
                       F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()

#+----+----+----+
#|name|diff|same|
#+----+----+----+
#|  c1|   2|   2|
#|  c2|   0|   4|
#|  c3|   1|   3|
#+----+----+----+

您也可以通過map dynamically來分解exploding a map_type來做到這一點。

from pyspark.sql import functions as F
from itertools import chain

df.withColumn("cols", F.create_map(*(chain(*[(F.lit(name), F.col(name))\
                                  for name in df.columns if name!='id']))))\
  .select(F.explode("cols").alias("name","val"))\
  .groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
                       F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()

#+----+----+----+
#|name|diff|same|
#+----+----+----+
#|  c1|   2|   2|
#|  c2|   0|   4|
#|  c3|   1|   3|
#+----+----+----+
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,'diff','same','diff'),(2,'same','same','same'),(3,'diff','same','same'),(4,'same','same','same')],['idcol','C1','C2','C3'])
df.createOrReplaceTempView("MyTable")
#spark.sql("select * from MyTable").collect()
x1=spark.sql("select idcol, 'C1' AS col, C1 from MyTable union all select idcol, 'C2' AS col, C2 from MyTable  union all select idcol, 'C3' AS col, C3 from MyTable")
#display(x1)
x2=x1.groupBy('col').pivot('C1').agg(count('C1')).orderBy('col')
display(x2)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM