简体   繁体   English

数据框中列的转换类型

[英]Casting type of columns in a dataframe

My Spark program needs to read a file which contains a matrix of integers. 我的Spark程序需要读取一个包含整数矩阵的文件。 Columns are separated with ",". 列用“,”分隔。 Number of columns is not the same each time I run the program. 每次运行程序时,列数都不相同。

I read the file as a dataframe: 我将文件作为数据帧读取:

var df = spark.read.csv(originalPath);

but when I print schema it gives me all the columns as Strings. 但是,当我打印模式时,它将所有列都显示为字符串。

I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings. 我将所有列都转换为Integers,如下所示,但是此后再次打印df架构时,列仍然是Strings。

df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));

I appreciate any help to solve the issue of casting. 感谢您为解决铸造问题提供的帮助。

Thanks. 谢谢。

DataFrames are immutable. DataFrames是不可变的。 Your code creates new DataFrame for each value and discards it. 您的代码为每个值创建新的DataFrame并将其丢弃。

It is best to use map and select : 最好使用mapselect

val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)

but you could foldLeft : 但是你可以foldLeft

df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))

or even ( please don't ) mutable reference: 甚至( 请不要 )可变参考:

var df = Seq(("1", "2", "3")).toDF

df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))

Or as you mentioned your column numbers are not same each time, you could take the highest number of possible column and make a schema out of it, having IntegerType as column type. 或者,正如您提到的,每次列号都不相同,您可以使用最大数量的列,并使用IntegerType作为列类型来创建模式。 During loading the file infer this schema to automatically convert your dataframe columns from string to integer. 在加载文件的过程中,此架构会自动将数据框的列从字符串转换为整数。 No explicit conversion required in this case. 在这种情况下,无需显式转换。

import org.apache.spark.sql.types._

val csvSchema = StructType(Array(
  StructField("_c0", IntegerType, true),
  StructField("_c1", IntegerType, true),
  StructField("_c2", IntegerType, true),
  StructField("_c3", IntegerType, true)))

val df = spark.read.schema(csvSchema).csv(originalPath)

scala> df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
 |-- _c3: integer (nullable = true)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM