简体   繁体   English

Pyspark 数据框将多列转换为浮点数

[英]Pyspark dataframe convert multiple columns to float

I am trying to convert multiple columns of a dataframe from string to float like this我正在尝试将数据帧的多列从字符串转换为像这样的浮动

df_temp = sc.parallelize([("1", "2", "3.4555"), ("5.6", "6.7", "7.8")]).toDF(("x", "y", "z"))
df_temp.select(*(float(col(c)).alias(c) for c in df_temp.columns)).show()

but I am getting the error但我收到错误

select() argument after * must be a sequence, not generator

I cannot understand why this error is being thrown我不明白为什么会抛出这个错误

float() is not a Spark function, you need the function cast() : float()不是 Spark 函数,您需要函数cast()

from pyspark.sql.functions import col
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))

if you want to cast some columns without change the whole data frame, you can do that by withColumn function:如果你想在不改变整个数据框的情况下投射一些列,你可以通过withColumn函数来做到这一点

for col_name in cols:
    df = df.withColumn(col_name, col(col_name).cast('float'))

this will cast type of columns in cols list and keep another columns as is.这将转换 cols 列表中的列类型并保持其他列不变。
Note :注意
withColumn function used to replace or create new column based on name of column; withColumn函数用于根据列名替换或创建新列;
if column name is exist it will be replaced, else it will be created如果列名存在,它将被替换,否则将被创建

If you want to cast multiple columns to float and keep other columns the same, you can use a single select statement.如果要将多个列强制转换为浮动并保持其他列相同,则可以使用单个 select 语句。

columns_to_cast = ["col1", "col2", "col3"]
df_temp = (
   df
   .select(
     *(c for c in df.columns if c not in columns_to_cast),
     *(col(c).cast("float").alias(c) for c in columns_to_cast)
   )
)

I saw the withColumn answer which will work, but since spark dataframes are immutable, each withColumn call generates a completely new dataframe我看到了可行的 withColumn 答案,但由于火花数据帧是不可变的,每个 withColumn 调用都会生成一个全新的数据帧

Here is another approach on how to do it :这是关于如何做到这一点的另一种方法:

cv = []   # list of columns you want to convert to Float
cf = []   # list of columns you don't want to change

l = ['float(x.'+c+')' for c in cv]
cst = '('+','.join(l)+')'

l2 = ['x.'+c for c in cf]
cst2 = '('+','.join(l2)+')'

df2rdd = df.map(lambda x : eval(cst2)+eval(cst))

df_output = sqlContext.createDataFrame(df2rdd,df.columns)

df_output is your required dataframe df_output 是您所需的数据帧

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM