数据框中列的转换类型

Question

My Spark program needs to read a file which contains a matrix of integers. 我的Spark程序需要读取一个包含整数矩阵的文件。 Columns are separated with ",". 列用“，”分隔。 Number of columns is not the same each time I run the program. 每次运行程序时，列数都不相同。

I read the file as a dataframe: 我将文件作为数据帧读取：

var df = spark.read.csv(originalPath);

but when I print schema it gives me all the columns as Strings. 但是，当我打印模式时，它将所有列都显示为字符串。

I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings. 我将所有列都转换为Integers，如下所示，但是此后再次打印df架构时，列仍然是Strings。

df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));

I appreciate any help to solve the issue of casting. 感谢您为解决铸造问题提供的帮助。

Thanks. 谢谢。

Answer 1

DataFrames are immutable. DataFrames是不可变的。 Your code creates new DataFrame for each value and discards it. 您的代码为每个值创建新的DataFrame并将其丢弃。

It is best to use map and select : 最好使用map并select ：

val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)

but you could foldLeft : 但是你可以foldLeft ：

df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))

or even ( please don't ) mutable reference: 甚至（ 请不要 ）可变参考：

var df = Seq(("1", "2", "3")).toDF

df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))

Answer 2

Or as you mentioned your column numbers are not same each time, you could take the highest number of possible column and make a schema out of it, having IntegerType as column type. 或者，正如您提到的，每次列号都不相同，您可以使用最大数量的列，并使用IntegerType作为列类型来创建模式。 During loading the file infer this schema to automatically convert your dataframe columns from string to integer. 在加载文件的过程中，此架构会自动将数据框的列从字符串转换为整数。 No explicit conversion required in this case. 在这种情况下，无需显式转换。

import org.apache.spark.sql.types._

val csvSchema = StructType(Array(
  StructField("_c0", IntegerType, true),
  StructField("_c1", IntegerType, true),
  StructField("_c2", IntegerType, true),
  StructField("_c3", IntegerType, true)))

val df = spark.read.schema(csvSchema).csv(originalPath)

scala> df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
 |-- _c3: integer (nullable = true)

数据框中列的转换类型

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-01-10 13:13:05

解决方案2
1 2018-01-11 06:51:58

数据框中列的转换类型

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-01-10 13:13:05

解决方案2 1 2018-01-11 06:51:58

解决方案1
4 已采纳 2018-01-10 13:13:05

解决方案2
1 2018-01-11 06:51:58