简体   繁体   English

Spark将列转换为存储在字符串中的SQL类型

[英]Spark cast column to sql type stored in string

The simple request is I need help adding a column to a dataframe but, the column has to be empty, its type is from ...spark.sql.types and the type has to be defined from a string. 简单的请求是我需要帮助,将一列添加到数据框中,但是该列必须为空,其类型来自... spark.sql.types,并且必须从字符串中定义类型。

I can probably do this with ifs or case but I'm looking for something more elegant. 我可能可以使用ifs或case做到这一点,但我正在寻找更优雅的东西。 Something that does not require writing a case for every type in org.apache.spark.sql.types 不需要为org.apache.spark.sql.types中的每种类型编写案例的东西

If I do this for example: 例如,如果我这样做:

df = df.withColumn("col_name", lit(null).cast(org.apache.spark.sql.types.StringType))

It works as intended, but I have the type stored as a string, 它可以按预期工作,但是我将类型存储为字符串,

var the_type = "StringType"

or var the_type = "org.apache.spark.sql.types.StringType" 或var the_type =“ org.apache.spark.sql.types.StringType”

and I can't get it to work by defining the type from the string. 而且我无法通过定义字符串的类型来使其正常工作。

For those interested here are some more details: I have a set containing tuples (col_name, col_type) both as strings and I need to add columns with the correct types for a future union between 2 dataframes. 对于那些感兴趣的人,这里有一些更多细节:我有一个包含元组(col_name,col_type)的集合,它们都作为字符串,并且我需要添加具有正确类型的列,以便将来在两个数据帧之间进行联合。

I currently have this: 我目前有这个:

for (i <- set_of_col_type_tuples) yield {
    val tip = Class.forName("org.apache.spark.sql.types."+i._2)
    df = df.withColumn(i._1, lit(null).cast(the_type))
    df }

if I use 如果我用

val the_type = Class.forName("org.apache.spark.sql.types."+i._2)

I get 我懂了

error: overloaded method value cast with alternatives:   (to: String)org.apache.spark.sql.Column <and>   (to: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column  cannot be applied to (Class[?0])

if I use 如果我用

val the_type = Class.forName("org.apache.spark.sql.types."+i._2).getName()

It's a string so I get: 这是一个字符串,所以我得到:

org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '.' expecting {<EOF>, '('}(line 1, pos 3)
== SQL == org.apache.spark.sql.types.StringType
---^^^

EDIT: So, just to be clear, the set contains tuples like this ("col1","IntegerType"), ("col2","StringType") not ("col1","int"), ("col2","string"). 编辑:因此,为了清楚起见,该集合包含这样的元组(“ col1”,“ IntegerType”),(“ col2”,“ StringType”)而不是(“ col1”,“ int”),(“ col2”, “串”)。 A simple cast(i._2) does not work. 简单的强制转换(i._2)不起作用。

Thank you. 谢谢。

You can use overloaded method cast , which has a String as an argument: 您可以使用重载方法cast ,该方法具有String作为参数:

val stringType : String = ...
column.cast(stringType)

def cast(to: String): Column def cast(to:String):列

Casts the column to a different data type, using the canonical string representation of the type. 使用该类型的规范字符串表示形式将列转换为其他数据类型。

You can also scan for all Data Types: 您还可以扫描所有数据类型:

val types = classOf[DataTypes]
    .getDeclaredFields()
    .filter(f => java.lang.reflect.Modifier.isStatic(f.getModifiers()))
    .map(f => f.get(new DataTypes()).asInstanceOf[DataType])

Now types is Array[DataType]. 现在,类型为Array [DataType]。 You can translate it to Map: 您可以将其翻译为地图:

val typeMap = types.map(t => (t.getClass.getSimpleName.replace("$", ""), t)).toMap

and use in code: 并在代码中使用:

column.cast(typeMap(yourType))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM