簡體   English   中英

我應該如何在 Spark 中構建這個執行流程?

[英]How should I structure this execution flow in Spark?

我一直在玩火花,但我無法理解如何構建這個執行流程。 偽代碼如下:

from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext(conf=conf)
sqlSC = SQLContext(sc)

df1 = getBigDataSetFromDb()
ddf1 = sqlSC.createDataFrame(sc.broadcast(df1))

df2 = getOtherBigDataSetFromDb()
ddf2 = sqlSC.createDataFrame(sc.broadcast(df2))

datesList = sc.parallelize(aListOfDates)

def myComplicatedFunc(cobDate):
    filteredDF1 = ddf1.filter(ddf1['BusinessDate'] == cobDate)
    filteredDF2 = ddf2.filter(ddf2['BusinessDate'] == cobDate)
    #some more complicated stuff that uses filteredDF1 & filteredDF2
    return someValue

results = datesList.map(myComplicatedFunc)

但是,我得到的是這樣的:

Traceback (most recent call last):
  File "/net/nas/SysGrid_Users/John.Richardson/Code/HistoricVars/sparkTest2.py", line 76, in <module>
    varResults = distDates.map(varFunc).collect()
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 2379, in _jrdd
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 2299, in _prepare_for_python_RDD
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 428, in dumps
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 646, in dumps
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 107, in dump
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 740, in save_tuple
    save(element)
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 199, in save_function
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 236, in save_function_tuple
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 725, in save_tuple
    save(element)
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 770, in save_list
    self._batch_appends(obj)
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 797, in _batch_appends
    save(tmp[0])
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 193, in save_function
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 241, in save_function_tuple
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 841, in _batch_setitems
    save(v)
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 542, in save_reduce
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 836, in _batch_setitems
    save(v)
  File "/net/nas/uxhome/condor_ldrt-s/Python/lib/python3.5/pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/net/nas/uxhome/condor_ldrt-s/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o44.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
        at py4j.Gateway.invoke(Gateway.java:252)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

我懷疑我以錯誤的方式解決這個問題。 我認為使用廣播變量的目的是我可以在閉包中使用。 但也許我必須做某種加入?

盡管我同意有關缺少域上下文的評論,但我認為這不是您想要的:

df2 = getOtherBigDataSetFromDb()
ddf2 = sqlSC.createDataFrame(sc.broadcast(df2))

您沒有說df2的類型是什么,但讓我們假設它是一個數組而不是實際上已經是DataFrame (盡管被命名為df* )。 如果它是一個數組,你可能想要的是:

df2 = getOtherBigDataSetFromDb()
ddf2 = sqlSC.createDataFrame(sc.parallelize(df2))

話雖如此, getOtherBigDataSetFromDb意味着它實際上是一個大數據集。 因此,雖然此流程可行,但如果您的數據集真的很大,您可能希望將其分塊使用。 您可以自己編寫,或者可能已經有一個庫可以從您的數據庫或選擇中讀取。 但無論如何,我相信你的意思是parallelize而不是broadcast

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM