[英]pyspark : error while creating new column in pyspark
我有一個 pyspark 數據框
a = [
(0.31, .3, .4, .6, 0.4),
(.01, .2, .92, .4, .47),
(.3, .1, .05, .2, .82),
(.4, .4, .3, .6, .15),
]
b = ["column1", "column2", "column3", "column4", "column5"]
df = spark.createDataFrame(a, b)
現在我想根據以下條件創建一個新列
df.withColumn('new_column' ,(norm.ppf(F.col('column1')) - norm.ppf(F.col('column1') * F.col('column1'))) / (1 - F.col('column2')) ** 0.5)
但它給出了錯誤。 請幫忙!
更新:我已經替換了更正的列名
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-8dfe7d50be84> in <module>
----> 1 df.withColumn('new_column' ,(norm.ppf(F.col('PD')) - norm.ppf(F.col('PD') * F.col('PD'))) / (1 - F.col('rho_start')) ** 0.5)
~/anaconda3/envs/python3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py in ppf(self, q, *args, **kwds)
1995 args = tuple(map(asarray, args))
1996 cond0 = self._argcheck(*args) & (scale > 0) & (loc == loc)
-> 1997 cond1 = (0 < q) & (q < 1)
1998 cond2 = cond0 & (q == 0)
1999 cond3 = cond0 & (q == 1)
~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/sql/column.py in __nonzero__(self)
633
634 def __nonzero__(self):
--> 635 raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
636 "'~' for 'not' when building DataFrame boolean expressions.")
637 __bool__ = __nonzero__
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
目前還不清楚您的列PD
和rho_start
可能是什么。 但我可以舉一個例子,說明如何使用 pyspark 調用 scipy 函數。
設置數據框
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
a = [
(0.31, .3, .4, .6, 0.4),
(.01, .2, .92, .4, .47),
(.3, .1, .05, .2, .82),
(.4, .4, .3, .6, .15),
]
b = ["column1", "column2", "column3", "column4", "column5"]
df = spark.createDataFrame(a, b)
df.show()
出去:
+-------+-------+-------+-------+-------+
|column1|column2|column3|column4|column5|
+-------+-------+-------+-------+-------+
| 0.31| 0.3| 0.4| 0.6| 0.4|
| 0.01| 0.2| 0.92| 0.4| 0.47|
| 0.3| 0.1| 0.05| 0.2| 0.82|
| 0.4| 0.4| 0.3| 0.6| 0.15|
+-------+-------+-------+-------+-------+
您可以使用pandas_udf
對計算進行矢量化
import pandas as pd
from scipy.stats import *
from pyspark.sql.functions import pandas_udf
@pandas_udf('double')
def vectorized_ppf(x):
return pd.Series(norm.ppf(x))
df.withColumn('ppf', vectorized_ppf('column1')).show()
出去:
+-------+-------+-------+-------+-------+-------------------+
|column1|column2|column3|column4|column5| ppf|
+-------+-------+-------+-------+-------+-------------------+
| 0.31| 0.3| 0.4| 0.6| 0.4|-0.4958503473474533|
| 0.01| 0.2| 0.92| 0.4| 0.47|-2.3263478740408408|
| 0.3| 0.1| 0.05| 0.2| 0.82|-0.5244005127080409|
| 0.4| 0.4| 0.3| 0.6| 0.15|-0.2533471031357997|
+-------+-------+-------+-------+-------+-------------------+
udf
時pandas_udf
不可用有時很難讓pandas_udf
正常工作。 您可以使用udf
作為替代。
將 scipy 函數定義為 udf
from scipy.stats import *
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType
@F.udf(DoubleType())
def ppf(x):
return float(norm.ppf(x))
調用new_column
ppf 以創建具有column1
值的new_column
df1 = df.withColumn('new_column' , ppf('column1'))
df1.show()
出去:
+-------+-------+-------+-------+-------+-------------------+
|column1|column2|column3|column4|column5| new_column|
+-------+-------+-------+-------+-------+-------------------+
| 0.31| 0.3| 0.4| 0.6| 0.4|-0.4958503473474533|
| 0.01| 0.2| 0.92| 0.4| 0.47|-2.3263478740408408|
| 0.3| 0.1| 0.05| 0.2| 0.82|-0.5244005127080409|
| 0.4| 0.4| 0.3| 0.6| 0.15|-0.2533471031357997|
+-------+-------+-------+-------+-------+-------------------+
我跑pandas_udf
(矢量)和udf
不同的輸入尺寸。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.