Python spark：如何在 databricks 中使用 spark 並行化 Spark Dataframe 計算

Question

我有一個 python 代碼，該代碼使用以下庫並行計算 dataframe： multiprocessing.pool

from multiprocessing.pool import ThreadPool as Pool

這是我創建 dataframe 的方法

df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': [3,4,2,5],
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})

這是我如何對 dataframe 的計算進行 parralize

def parallelize_dataframe(df, func, n_cores=1):  
  df_split = np.array_split(df, n_cores) # split dataframe into n_cores
  pool = Pool(n_cores) # load cpu cores numbers
  df = pd.concat(pool.map(func, df_split),ignore_index=True) #we collect all the data 
  pool.close() # end of proccesing
  pool.join() # join data 
  return df # return the dataframe

這是要在 dataframe 列上處理的 function

def TrouverLesTest(x):
    if(x=='test'):
        return True
    elif(x=='train'):
        return False

這是 function 做許多申請 dataframe

def Do_Compute(df):
  df['E_Det']=df['E'].apply(TrouverLesTest)
  df['E_Det_vs']=df['E'].apply(TrouverLesTest)
  return df

parallelize_dataframe(df2, Do_Compute)

output：

A   B   C           D       E       F   E_Det   E_Det_vs
0   1.0 2013-01-02  1.0 3   test    foo True    True
1   1.0 2013-01-02  1.0 4   train   foo False   False
2   1.0 2013-01-02  1.0 2   test    foo True    True
3   1.0 2013-01-02  1.0 5   train   foo False   False

我的問題：在 spark 中使用 function Do_Compute(df) 時，如何更快地計算 dataframe df2？

Answer 1

如果你只想像你的問題一樣做一個簡單的計算，你可以做

import pyspark.sql.functions as F

df2 = df.withColumn(
    'E_Det',
    F.when(F.col('E') == 'test', F.lit(True)).when(F.col('E') == 'train', F.lit(False))
)

請注意，如果您使用的是 Spark，則無需弄亂多線程庫。 Spark本質上是一個並行計算框架。

要在 Spark 中更一般地使用 Python/pandas 函數，您可以使用mapInPandas ：

def Do_Compute(iterator):
    for df in iterator:
        df['E_Det']=df['E'].apply(TrouverLesTest)
        df['E_Det_vs']=df['E'].apply(TrouverLesTest)
        yield df

df2 = df.mapInPandas(Do_compute, schema)

您需要在其中提供生成的 dataframe 的架構。 例如

from pyspark.sql.types import *

schema = df.schema.add(StructField('E_Det', BooleanType())).add(StructField('E_Det_vs', BooleanType()))

有關其用法的更多詳細信息，請參閱文檔。

Python spark：如何在 databricks 中使用 spark 並行化 Spark Dataframe 計算

問題描述

這是我創建 dataframe 的方法

這是我如何對 dataframe 的計算進行 parralize

這是要在 dataframe 列上處理的 function

這是 function 做許多申請 dataframe

1 個解決方案

解決方案1
0 2021-02-03 18:16:22

Python spark：如何在 databricks 中使用 spark 並行化 Spark Dataframe 計算

問題描述

這是我創建 dataframe 的方法

這是我如何對 dataframe 的計算進行 parralize

這是要在 dataframe 列上處理的 function

這是 function 做許多申請 dataframe

1 個解決方案

解決方案1 0 2021-02-03 18:16:22

解決方案1
0 2021-02-03 18:16:22