[英]How to apply groupby and transpose in Pyspark?
我有一個 dataframe 如下圖所示
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
})
我上面的輸入 dataframe 看起來像這樣
盡管下面的代碼在 Python pandas 中運行良好(感謝 Jezrael),但當我將其應用於真實數據(超過 4M 條記錄)時,它運行了很長時間。 所以我試圖使用pyspark
。 請注意,我已經嘗試過Dask
、 modin
、 pandarallel
,它們等效於 pandas 進行大規模處理,但也沒有幫助。 以下代碼的作用是it generates the summary statistics for each subject for each reading
。 您可以在下面查看預期的 output 以獲得一個想法
df_op = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df_op.columns = df_op.columns.map('_'.join)
df_op = df_op.reset_index()
你能幫我在pyspark中實現上述操作嗎? 當我嘗試以下時,它拋出了一個錯誤
df.groupby(['subject_id','readings'])['val']
例如 - subject_id = 1 有 4 個讀數,但有 3 個唯一讀數。 因此,對於 subject_id = 1,我們得到 3 * 8 = 24 列。為什么是 8? 因為它是MIN,MAX,COUNT,Std,MEAN,25%percentile,50th percentile,75th percentile
。 希望這可以幫助
當我在 pyspark 開始使用此功能時,它返回以下錯誤
TypeError: 'GroupedData' object 不可下標
我希望我的 output 如下所示
您需要先分組並獲取每個讀數的統計信息,然后制作 pivot 以獲得預期結果
import pyspark.sql.functions as F
agg_df = df.groupby("subject_id", "readings").agg(F.mean(F.col("val")), F.min(F.col("val")), F.max(F.col("val")),
F.count(F.col("val")),
F.expr('percentile_approx(val, 0.25)').alias("quantile_25"),
F.expr('percentile_approx(val, 0.75)').alias("quantile_75"))
這將為您提供以下 output:
+----------+--------+--------+--------+--------+----------+-----------+-----------+
|subject_id|readings|avg(val)|min(val)|max(val)|count(val)|quantile_25|quantile_75|
+----------+--------+--------+--------+--------+----------+-----------+-----------+
| 2| READ_1| 5.0| 5| 5| 1| 5| 5|
| 2| READ_5| 7.0| 7| 7| 1| 7| 7|
| 2| READ_8| 12.0| 12| 12| 1| 12| 12|
| 4| READ_08| 43.0| 43| 43| 1| 43| 43|
| 1| READ_2| 6.0| 6| 6| 1| 6| 6|
| 1| READ_1| 6.0| 5| 7| 2| 5| 7|
| 2| READ_6| 16.0| 16| 16| 1| 16| 16|
| 1| READ_3| 11.0| 11| 11| 1| 11| 11|
| 4| READ_11| 32.0| 32| 32| 1| 32| 32|
| 3| READ_10| 13.0| 13| 13| 1| 13| 13|
| 3| READ_12| 56.0| 56| 56| 1| 56| 56|
| 4| READ_14| 13.0| 13| 13| 1| 13| 13|
| 4| READ_07| 46.0| 46| 46| 1| 46| 46|
| 4| READ_09| 45.0| 45| 45| 1| 45| 45|
+----------+--------+--------+--------+--------+----------+-----------+-----------+
如果您的 pivot readings
使用 groupby subject_id
,您將獲得預期的 output:
agg_df2 = df.groupby("subject_id").pivot("readings").agg(F.mean(F.col("val")), F.min(F.col("val")), F.max(F.col("val")),
F.count(F.col("val")),
F.expr('percentile_approx(val, 0.25)').alias("quantile_25"),
F.expr('percentile_approx(val, 0.75)').alias("quantile_75"))
for i in agg_df2.columns:
agg_df2 = agg_df2.withColumnRenamed(i, i.replace("(val)", ""))
agg_df2.show()
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
|subject_id|READ_07_avg(val)|READ_07_min(val)|READ_07_max(val)|READ_07_count(val)|READ_07_quantile_25|READ_07_quantile_75|READ_08_avg(val)|READ_08_min(val)|READ_08_max(val)|READ_08_count(val)|READ_08_quantile_25|READ_08_quantile_75|READ_09_avg(val)|READ_09_min(val)|READ_09_max(val)|READ_09_count(val)|READ_09_quantile_25|READ_09_quantile_75|READ_1_avg(val)|READ_1_min(val)|READ_1_max(val)|READ_1_count(val)|READ_1_quantile_25|READ_1_quantile_75|READ_10_avg(val)|READ_10_min(val)|READ_10_max(val)|READ_10_count(val)|READ_10_quantile_25|READ_10_quantile_75|READ_11_avg(val)|READ_11_min(val)|READ_11_max(val)|READ_11_count(val)|READ_11_quantile_25|READ_11_quantile_75|READ_12_avg(val)|READ_12_min(val)|READ_12_max(val)|READ_12_count(val)|READ_12_quantile_25|READ_12_quantile_75|READ_14_avg(val)|READ_14_min(val)|READ_14_max(val)|READ_14_count(val)|READ_14_quantile_25|READ_14_quantile_75|READ_2_avg(val)|READ_2_min(val)|READ_2_max(val)|READ_2_count(val)|READ_2_quantile_25|READ_2_quantile_75|READ_3_avg(val)|READ_3_min(val)|READ_3_max(val)|READ_3_count(val)|READ_3_quantile_25|READ_3_quantile_75|READ_5_avg(val)|READ_5_min(val)|READ_5_max(val)|READ_5_count(val)|READ_5_quantile_25|READ_5_quantile_75|READ_6_avg(val)|READ_6_min(val)|READ_6_max(val)|READ_6_count(val)|READ_6_quantile_25|READ_6_quantile_75|READ_8_avg(val)|READ_8_min(val)|READ_8_max(val)|READ_8_count(val)|READ_8_quantile_25|READ_8_quantile_75|
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
| 1| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6.0| 5| 7| 2| 5| 7| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6.0| 6| 6| 1| 6| 6| 11.0| 11| 11| 1| 11| 11| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| 3| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 13.0| 13| 13| 1| 13| 13| null| null| null| null| null| null| 56.0| 56| 56| 1| 56| 56| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| 2| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 5.0| 5| 5| 1| 5| 5| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 7.0| 7| 7| 1| 7| 7| 16.0| 16| 16| 1| 16| 16| 12.0| 12| 12| 1| 12| 12|
| 4| 46.0| 46| 46| 1| 46| 46| 43.0| 43| 43| 1| 43| 43| 45.0| 45| 45| 1| 45| 45| null| null| null| null| null| null| null| null| null| null| null| null| 32.0| 32| 32| 1| 32| 32| null| null| null| null| null| null| 13.0| 13| 13| 1| 13| 13| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.