简体   繁体   中英

Java Spark: How to get value from a column which is JSON formatted string for entire dataset?

Needs some help here. I am trying to read data from Hive/CSV. There is a column whose type is string and the value is json formatted string. It is something like this:

|                      Column Name A                       |
|----------------------------------------------------------|
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|

How can I get the value of key_2 and insert it to a new column?

I tried to create a new function to the get value via Gson

private BigDecimal getValue(final String columnValue){
    JsonObject jsonObject = JsonParser.parseString(columnValue).getAsJsonOBject();
    return jsonObject.get("key").getAsJsonObject().get("key_1").getAsJsonObject().get("key_2").getAsJsonArray().get(0).getAsBigDecimal();
}

But how i can apply this method to the whole dataset?

I was trying to achieve something like this:

Dataset<Row> ds = souceDataSet.withColumn("New_column", getValue(sourceDataSet.col("Column Name A")));

But it cannot be done as the data types are different...

Could you please give any suggestions?

Thx! hx!

------------------Update---------------------

As @Mck suggested, I used get_json_object. As my value contains "

"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"

I used substring to removed " and make the new string like this

{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}

Code for substring

DataSet<Row> dsA = sourceDataSet.withColumn("Column Name A",expr("substring(Column Name A, 2, length(Column Name A))"))

I used dsA.show() and confirmed the dataset looks correct.

Then I used following code try to do it

Dataset<Row> ds = dsA.withColumn("New_column",get_json_object(dsA.col("Column Name A"), "$.key.data.key_2[0]"));

which returns null .

However, if the data is this:

{"key":{"data":{"key_2":[456]}}}

I can get value 456.

Any suggestions why I get null? Thx for the help!

Use get_json_object :

ds.withColumn(
    "New_column",
    get_json_object(
        col("Column Name A").substr(lit(2), length(col("Column Name A")) - 2),
        "$.key.data.key_2[0]")
).show(false)

+----------------------------------------------------------+----------+
|Column Name A                                             |New_column|
+----------------------------------------------------------+----------+
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|456       |
+----------------------------------------------------------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM