標簽[pandas-udf] - 堆棧內存溢出

Azure Databrickd:- PythonException: 'RuntimeError: 標量迭代器 pandas UDF 中 output 的長度應與輸入相同；

[英]Azure Databrickd:- PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's;

環境：Azure Databricks 集群：11.3 LTS（包括 Apache Spark 3.3.0、Scala 2.12）我有pandas_udf ，它為 4 行工作，但我嘗試了 4 行以上的錯誤。 PythonException: 'RuntimeError: 標量迭代器 panda ...

pandas udf into 數組類型的列

[英]pandas udf into column in array type

我的任務是將以下內容存儲到數組類型的列中：當我運行 cmd 時，出現了這個錯誤： TypeError: anomaly_detections() 采用 1 個位置參數，但給出了 2 個任何幫助將不勝感激我期望列“deviceIssues”將在 arraytype 列中。 ...

pandas_udf 錯誤，向量預期為 1，得到 2

[英]Error in pandas_udf with the vector expected 1, got 2

我試圖獲取帶有緯度和經度的國家/地區名稱作為輸入，因此我使用了 Nominatim API 並且當我作為 UDF 傳遞時它有效，但是當我嘗試使用 pandas_udf 時出現以下錯誤： UDF 拋出異常：“RuntimeError：來自 pandas_udf 的結果向量不是所需的長度：預期 1， ...

使用 pandas udf 不在 pyspark 中循環

[英]Using pandas udf without looping in pyspark

所以假設我有一個大火花 dataframe。我不知道有多少列。（解決方案必須在 pyspark 中使用 pandas udf。不是不同的方法）我想對所有列執行操作。所以可以在所有列中循環但我不想遍歷行。我希望它立即作用於列。我沒有在 inte.net 上找到如何做到這一點。假設我有這個 ...

Pandas UDF 結構域返回

[英]Pandas UDF Structfield return

我正在嘗試從 Pyspark 中的 Pandas UDF 返回一個 StructField，該 UDF 與具有以下 function 簽名的聚合一起使用：但事實證明不支持返回類型。有沒有其他方法可以達到同樣的目的。我可以制作三個 Pandas udf 並返回原始類型並且可以工作，但是 func ...

Pandas UDF，帶有字典查找和條件

[英]Pandas UDF with dictionary lookup and conditionals

我想在 Pyspark 中使用 pandas_udf 進行某些列的轉換和計算。而且似乎 pandas udf 不能完全像普通 UDF 那樣寫。示例 function 如下所示：基本上，從火花 dataframe 中獲取兩列值並返回我打算與withColumn一起使用的值：但這不起作用。我應 ...

Geopandas 轉換 crs

[英]Geopandas convert crs

我創建了一個 geopandas dataframe，其中包含 5000 萬條記錄，其中包含 CRS 3857 中的緯度經度，我想轉換為 4326。由於數據集很大，geopandas 無法轉換 this.我如何以分布式方式執行此操作。 ...

使用 pandas_udf 應用 wordninja.split()

[英]Apply wordninja.split() using pandas_udf

我有一個 dataframe df ，其列sld為 string 類型，其中包括一些沒有空格/分隔符的連續字符。可用於拆分的庫之一是 wordninja：例如wordninja.split('culturetosuccess')輸出['culture','to','success'] 使用pa ...

使用 Pandas UDF 遍歷數據幀並輸出數據幀

[英]Iterating through a DataFrame using Pandas UDF and outputting a dataframe

我有一段代碼想在 PySpark 中翻譯成 Pandas UDF，但我在理解是否可以使用條件語句時遇到了一些麻煩。 def is_pass_in(df): x = list(df["string"]) result = [] for i in x: if "p ...

PySpark：用於 scipy 統計轉換的 Pandas UDF

[英]PySpark: Pandas UDF for scipy statistical transformations

我正在嘗試在 Spark 數據幀上創建一列 x 列的標准化（z 分數）列，但由於沒有一個工作正常而缺少一些東西。這是我的例子：這導致明顯錯誤的計算：謝謝您的幫助。 ...

與作為作業運行時相比，Databricks 筆記本在手動觸發時運行速度更快

[英]Databricks notebook runs faster when triggered manually compared to when run as a job

我不知道這個問題是否已經在前面討論過，但它是這樣的 - 我有一個筆記本，我可以使用筆記本中的“運行”按鈕手動運行或作為一項工作。直接運行筆記本的運行時間大約是 2 小時。但是當我將它作為一項工作執行時，運行時間非常長（大約 8 小時）。花費時間最長的一段代碼調用了 applyInPandas ...

將一組列除以 Pyspark 中的平均值

[英]Dividing a set of columns by its average in Pyspark

我必須將 pyspark.sql.dataframe 中的一組列除以它們各自的列平均值，但我找不到正確的方法。下面是示例數據和我目前的代碼。輸入數據預計 Output 截至目前Function。不工作： ...

pyspark SparseVectors dataframe columns.dot product 或使用@udf 或@pandas_udf 的任何其他向量類型的列計算

[英]pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using @udf or @pandas_udf

我確實嘗試計算給定 dataframe 的 2 列之間的.dot積， SparseVectors已經在 spark 中具有這種能力所以我嘗試以一種簡單且可擴展的方式執行它而不轉換為RDD或DenseVectors但我被卡住了，過去了3 天嘗試找出一種方法，但確實失敗了，不返回從 dataframe ...

並行化 MLflow 項目在 Azure Databricks Spark 上使用 Pandas UDF 運行

[英]Parallelize MLflow Project runs with Pandas UDF on Azure Databricks Spark

我正在嘗試在 Azure Databricks 上使用 Spark 並行化多個時間序列的訓練。除了培訓，我還想使用 MLflow 記錄指標和模型。代碼結構很簡單（基本上改編了這個例子）。 Databricks 筆記本觸發 MLflow 項目主叫function 。它基本上執行三個步驟：讀 ...

PySpark UDF 到 Pandas UDF 用於刺柱

[英]PySpark UDF to Pandas UDF for sting columns

我確實有一個對於大型數據集來說很慢的 UDF，我嘗試通過利用pandas_udfs來提高執行時間和可擴展性，所有搜索和官方文檔都更加關注我已經使用的標量和映射方法，但我確實未能擴展到系列或 pandas dataframe 方法，你能指出我正確的方向嗎？我確實想並行執行，並且當前的 UDF 方法非 ...