Pyspark 数据框将多列转换为浮点数

Question

I am trying to convert multiple columns of a dataframe from string to float like this我正在尝试将数据帧的多列从字符串转换为像这样的浮动

df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z"))
df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show()

but I am getting the error但我收到错误

select() argument after * must be a sequence, not generator

I cannot understand why this error is being thrown我不明白为什么会抛出这个错误

Answer 1

float() is not a Spark function, you need the function cast() : float()不是 Spark 函数，您需要函数cast() ：

from pyspark.sql.functions import col
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))

Answer 2

if you want to cast some columns without change the whole data frame, you can do that by withColumn function:如果你想在不改变整个数据框的情况下投射一些列，你可以通过withColumn函数来做到这一点：

for col_name in cols:
    df = df.withColumn(col_name, col(col_name).cast('float'))

this will cast type of columns in cols list and keep another columns as is.这将转换 cols 列表中的列类型并保持其他列不变。
Note :注意：
withColumn function used to replace or create new column based on name of column; withColumn函数用于根据列名替换或创建新列；
if column name is exist it will be replaced, else it will be created如果列名存在，它将被替换，否则将被创建

Answer 3

If you want to cast multiple columns to float and keep other columns the same, you can use a single select statement.如果要将多个列强制转换为浮动并保持其他列相同，则可以使用单个 select 语句。

columns_to_cast = ["col1", "col2", "col3"]
df_temp = (
   df
   .select(
     *(c for c in df.columns if c not in columns_to_cast),
     *(col(c).cast("float").alias(c) for c in columns_to_cast)
   )
)

I saw the withColumn answer which will work, but since spark dataframes are immutable, each withColumn call generates a completely new dataframe我看到了可行的 withColumn 答案，但由于火花数据帧是不可变的，每个 withColumn 调用都会生成一个全新的数据帧

Answer 4

Here is another approach on how to do it :这是关于如何做到这一点的另一种方法：

cv = []   # list of columns you want to convert to Float
cf = []   # list of columns you don't want to change

l = ['float(x.'+c+')' for c in cv]
cst = '('+','.join(l)+')'

l2 = ['x.'+c for c in cf]
cst2 = '('+','.join(l2)+')'

df2rdd = df.map(lambda x : eval(cst2)+eval(cst))

df_output = sqlContext.createDataFrame(df2rdd,df.columns)

df_output is your required dataframe df_output 是您所需的数据帧

Pyspark 数据框将多列转换为浮点数

问题描述

4 个解决方案

解决方案1
34 已采纳 2016-11-08 10:48:43

解决方案2
26 2018-08-14 13:02:18

解决方案3
0 2021-12-03 18:13:42

解决方案4
-3 2017-02-16 00:14:43

Pyspark 数据框将多列转换为浮点数

问题描述

4 个解决方案

解决方案1 34 已采纳 2016-11-08 10:48:43

解决方案2 26 2018-08-14 13:02:18

解决方案3 0 2021-12-03 18:13:42

解决方案4 -3 2017-02-16 00:14:43

解决方案1
34 已采纳 2016-11-08 10:48:43

解决方案2
26 2018-08-14 13:02:18

解决方案3
0 2021-12-03 18:13:42

解决方案4
-3 2017-02-16 00:14:43