简体   繁体   中英

Spark Scala: Cast StructType to String

I read json as:

val df = spark.read.json(rdd)

I read messages from different topics so I cannot specify explicit schema. Some message contains fields with nested json and they are converted to StructType. For example:

{"name": "John", "son": {"name":"Tom"}}

How to cast it to String? I need to read "son" field as String:

"{\"name\":\"Tom\"}"

Using cast method or sql function fails:

df.selectExpr("cast(son as string)")

Error:

java.lang.String is not a valid external type for schema of struct<name:string>

您可以使用to_json轻松返回返回字符串

df.select(to_json(df("son")))

Sorry, I misunderstood your question. I thought you had different schema and sometimes the field was returned as a struct and sometimes as a string, and that you wanted to transform it to a string every time. I leave the answer just for information purposes.


I tried a small test case locally and apparently if I let Spark to interfer the schema, it considers my "son" field as a String. I don't know how do you build the processing logic but as a "workaround" you could try to specify a schema manually and type "son" as a String ?

val testDataset =
  """
    | {"name": "John", "son": {"name":"Tom"}}
    | {"name": "John", "son": "Tom"}
  """.stripMargin
val testJsonFile = new File("./test_json.json")
FileUtils.writeStringToFile(testJsonFile, testDataset)


val schema = StructType(
  Seq(StructField("name", DataTypes.StringType, true), StructField("son", DataTypes.StringType, true))
)
val sparkSession = SparkSession.builder()
    .appName("Test inconsistent field type").master("local[*]").getOrCreate()
val structuredJsonData = sparkSession.read.schema(schema).json(testJsonFile.getAbsolutePath)
import sparkSession.implicits._

val collectedDataset = structuredJsonData.map(row => row.getAs[String]("son")).collect()
println(s"got=${collectedDataset.mkString("---")}")
structuredJsonData.printSchema()

It prints:

got={"name":"Tom"}---Tom
root
 |-- name: string (nullable = true)
 |-- son: string (nullable = true)

You could still try to define a custom mapping function. However I'm not sure it'll work because when I try to apply a schema with StructType to JSONs with a StringType instead, the whole line is ignored (null values in both fields):

val testDataset =
  """
    | {"name": "John", "son": {"name":"Tom"}}
    | {"name": "John", "son": "Tom2"}
  """.stripMargin
val testJsonFile = new File("./test_json.json")
FileUtils.writeStringToFile(testJsonFile, testDataset)

val schema = StructType(
  Seq(StructField("name", DataTypes.StringType, true), StructField("son", StructType(Seq(StructField("name", DataTypes.StringType, true))))
  )
)
val sparkSession = SparkSession.builder()
    .appName("Test inconsistent field type").master("local[*]").getOrCreate()
val structuredJsonData = sparkSession.read.schema(schema).json(testJsonFile.getAbsolutePath)
println(s"got=${structuredJsonData.collect().mkString("---")}")
structuredJsonData.printSchema()

It prints:

got=[John,[Tom]]---[null,null]
root
 |-- name: string (nullable = true)
 |-- son: struct (nullable = true)
 |    |-- name: string (nullable = true)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM