繁体   English   中英

使用sklearn.DBSCAN的pyspark在本地提交Spark作业后出现错误

[英]pyspark using sklearn.DBSCAN getting error after submit the spark job locally

我在pyspark作业中使用sklearn.DBSCAN。 请参见下面的代码段。 我还压缩了添加到SparkContext中的deps.zip文件中的所有依赖项模块。

from sklearn.cluster import DBSCAN
import numpy as np
#import pyspark
from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql.types import DoubleType
from pyspark.sql import Row

def dbscan_latlng(lat_lngs,mim_distance_km,min_points=10):

coords = np.asmatrix(lat_lngs)
kms_per_radian = 6371.0088
epsilon = mim_distance_km/ kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=min_points, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
maxClusters = clusters.map(len).max()
if (maxClusters > 3):
  dfClusters = clusters.to_frame('coords')
  dfClusters['length'] = dfClusters.apply(lambda x: len(x['coords']), axis=1)
  custCluster = dfClusters[dfClusters['length']==maxClusters].reset_index()
  return custCluster['coords'][0].tolist()

sc = SparkContext()
sc.addPyFile('/content/airflow/dags/deps.zip')
sqlContext = SQLContext(sc)

但是,在通过spark-submit -master local [4] FindOutliers.py提交作业之后,出现以下python错误,提示sklearn / __ check_build不是目录。 谁能帮我这个? 非常感谢!

由以下原因引起:org.apache.spark.api.python.PythonException:追溯(最近一次调用为上):文件“ /root/.virtualenvs/jacob/local/lib/python2.7/site-packages/pyspark/python/lib /pyspark.zip/pyspark/worker.py”,第166行,位于主函数,探查器,解串器,序列化器中= read_command(pickleSer,infile)文件“ /root/.virtualenvs/jacob/local/lib/python2.7/site” -packages / pyspark / python / lib / pyspark.zip / pyspark / worker.py”,第55行,在read_command命令= serializer._read_with_length(file)文件“ /root/.virtualenvs/jacob/local/lib/python2.7”中/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”,第169行,在_read_with_length中,返回self.loads(obj)文件“ /root/.virtualenvs/jacob/local/lib/python2”。 7 / site-packages / pyspark / python / lib / pyspark.zip / pyspark / serializers.py“,第454行,在加载中返回pickle.loads(obj)文件” / tmp / pip-build-0qnWWw / scikit-learn / sklearn / init .py”,文件“ / tmp / pip-build-0qnWWw / scikit-learn / sklearn / check_build / __ init .py”,行46,文件“ / tmp / pi” p-build-0qnWWw / scikit-learn / sklearn / check_build / __ init .py“,第26行,位于raise_build_error OSError中:[Errno 20]不是目录:'/ tmp / spark-beb8777f-b7d5-40be-a72b-c16e10264a50 / userFiles-3762d9c0-6674-467a-949b-33968420bae1 / deps.zip / sklearn / __ check_build'

尝试:

import pyspark as ps

sc = ps.SparkContext()
sc.addPyFile('/content/airflow/dags/deps.zip')
sqlContext = ps.SQLContext

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM