[英]Python Spark local parallelism
我在本地運行Python Spark以運行在Spark網站上找到的示例 。 我生成了一個隨機的dataFrame以具有更大的樣本來進行性能測試。
我已經像這樣設置了我的SparkSession和SparkContext:
spark = SparkSession.builder \
.master("local[*]") \
.appName("KMeansParallel") \
.getOrCreate()
sc = spark.sparkContext
但是該程序似乎並未像此處建議的那樣在並行進程上運行。 我在任務管理器上看到只使用了10-25%的處理器,這使我認為Python停留在一個內核上(通過GIL?)。
我做錯了什么? 我試圖在SparkSession上更改一些參數:
.config("spark.executor.instances", 7) \
.config("spark.executor.cores", 3) \
.config("spark.default.parallelism", 7) \
.config("spark.driver.memory", "15g") \
我使用16GB內存,4核,8個邏輯處理器運行。 我按照此處的建議保留了一些OS資源(即使本地可能與YARN配置不同)。
完整的代碼:
from pyspark.sql import SparkSession, Row
from pyspark import SparkContext
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
import numpy as np
import math
import time
def gaussianMixture(sc, spark, nPoints, nGaussian, gaussianVariance):
"""
Returns a dataframe with <nPoints> points generated randomly
around <nGaussian> centers by a normal distribution
N(<a center chosen randomly>, <gaussianVariance>)
"""
#Generating centers
meanPointsNumpy = np.random.rand(nGaussian, 2)
def geneRandomChoice(nGaussian, nPoints):
for i in range(nPoints):
yield (i, np.random.choice(nGaussian, 1))
#Generating points in a numpy ndarray
dataNumpy = np.array([
[t[0],
np.random.normal(loc = meanPointsNumpy[t[1],0], scale = math.sqrt(gaussianVariance)),
np.random.normal(loc = meanPointsNumpy[t[1],1], scale = math.sqrt(gaussianVariance))]
for t in geneRandomChoice(nGaussian, nPoints)
])
#Converting ndarray to RDD then to dataFrame
dataRDD = sc \
.parallelize(dataNumpy) \
.map(lambda x: Row(label = int(x[0]), features = Vectors.dense(x[1].astype(float), x[2].astype(float))))
data = spark.createDataFrame(dataRDD)
return data
def kMeansParallel(sc, spark, nPoints, nGaussian, gaussianVariance):
"""
Evaluates the clusters from the dataFrame created
by the gaussianMixture function
"""
dataset = gaussianMixture(sc, spark, nPoints, nGaussian, gaussianVariance)
t1 = time.time()
# Trains a k-means model.
kmeans = KMeans().setK(nGaussian)#.setSeed(1)
model = kmeans.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
#print("Silhouette with squared euclidean distance = " + str(silhouette))
return time.time() - t1
nPoints = 10000
nGaussian = 100
gaussianVariance = 0.1
nTests = 20
spark = SparkSession.builder \
.master("local[*]") \
.appName("KMeansParallel") \
.getOrCreate()
sc = spark.sparkContext
meanTime = 0
for i in range(nTests):
res = kMeansParallel(sc, spark, nPoints, nGaussian, gaussianVariance)
meanTime += res
meanTime /= nTests
print("Mean Time : " + str(meanTime))
spark.stop()
GIL沒問題,因為spark根據需要運行多個python實例。 分布式運行時,每個執行程序一個;而本地運行時,每個核心一個(因為它們全部在驅動程序中運行)。
數據大小/分區數很可能太低
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.