简体   繁体   中英

Apache Spark to_json options parameter

I either don't know what I'm looking for or the documentation is lacking. The latter seems to be the case, given this:

http://spark.apache.org/docs/2.2.2/api/java/org/apache/spark/sql/functions.html#to_json-org.apache.spark.sql.Column-java.util.Map-

"options - options to control how the struct column is converted into a json string. accepts the same options and the json data source."

Great! So, what are my options?

I'm doing something like this:

Dataset<Row> formattedReader = reader
    .withColumn("id", lit(id))
    .withColumn("timestamp", lit(timestamp))
    .withColumn("data", to_json(struct("record_count")));

...and I get this result:

{
  "id": "ABC123",
  "timestamp": "2018-11-16 20:40:26.108",
  "data": "{\"record_count\": 989}"
}

I'd like this (remove back-slashes and quotes from "data"):

{
  "id": "ABC123",
  "timestamp": "2018-11-16 20:40:26.108",
  "data": {"record_count": 989}
}

Is this one of the options by chance? Is there a better guide out there for Spark? The most frustrating part about Spark hasn't been getting it to do what I want, it's been a lack of good information on what it can do.

You are json encoding twice for the record_count field. Remove to_json. struct alone should be sufficient.

As in change your code to something like this.

Dataset<Row> formattedReader = reader
    .withColumn("id", lit(id))
    .withColumn("timestamp", lit(timestamp))
    .withColumn("data", struct("record_count"));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM