Spark 集群無法識別 Python 庫

Question

我有一個包含文本數據的 Spark Dataframe。 我正在嘗試使用 Python BeautifulSoup 庫從數據中清除 html 標記。

但是，當我在 Mac 筆記本電腦本地安裝的 Spark 上使用 BeautifulSoup 時，它可以與 Spark udf 一起正常工作並清理標記。

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def html_parsing(x):
    """ Cleans the text from Data Frame text column"""

    textcleaned=''
    #if row['desc'] is not None: 
    souptext=BeautifulSoup(x)
    #souptext=BeautifulSoup(text)
    p_tags=souptext.find_all('p')
    for p in p_tags: 
        if p.string:
            textcleaned+=p.string
    #print textcleaned
    #ret_list= (int(row['id']),row['title'],textcleaned)

    return textcleaned


parse_html=udf(html_parsing,StringType())

sdf_cleaned=sdf_rss.dropna(subset=['desc']).withColumn('text_cleaned',parse_html('desc'))\
.select('id','title','text_cleaned')

sdf_cleaned.cache().take(3)

[Row(id=u'-33753621', title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', text_cleaned=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),

但是，當我啟動安裝在集群上的 Spark 並使用相同的代碼時，它顯示“沒有名為 bs4 的模塊”。 在集群上安裝了 pyspark 內核的 Anaconda jupyter notebook 中運行上述相同的代碼。

Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9, 107-45-c02.sc): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
ImportError: No module named bs4

我想強調的是 Spark 集群 Anaconda 也安裝了 BeautifulSoup，我通過做來確認它

conda list

並在那里顯示包裹。

那么這里我遺漏了什么問題？

非常感謝您的幫助

Answer 1

我遇到了同樣的問題，並了解到即使您在conda / pip需求文件中需要依賴項 - 這仍然並不意味着所有 spark 工作節點也將具有該依賴項。 因此，您也需要將所有依賴項分布在 spark 節點上，或者至少確保它們已經到達集群的開始處。

該回復中給出了很好的例子： https : //stackoverflow.com/a/49971939/2957102

但是對於您剝離數據的情況，您可能可以嘗試找到一種方法來在沒有beautifulsoup情況下執行此操作，這也可能是一種解決方法。 比如先加載數據，然后用pandas或其他東西做修改。

在我的例子中，有一個 UI 按鈕可以在 azure 上的 databricks spark 集群中啟用特定的庫，它解決了ModuleNotFoundError: No module named 'bs4'問題。 見圖片：

PS 請記住，在工作節點上安裝依賴項期間，您的集群應該在線。

Spark 集群無法識別 Python 庫

問題描述

1 個解決方案

解決方案1
0 2021-11-23 17:02:33

Spark 集群無法識別 Python 庫

問題描述

1 個解決方案

解決方案1 0 2021-11-23 17:02:33

解決方案1
0 2021-11-23 17:02:33