简体   繁体   中英

Spark - convert JSON array object to array of string

as part of my dataframe, one of the column has data in following manner

[{"text":"Tea"},{"text":"GoldenGlobes"}]

And I want to convert that as just array of strings.

["Tea", "GoldenGlobes"]

Would someone please let me know, how to do this?

See the example below without udf :

import pyspark.sql.functions as f
from pyspark import Row
from pyspark.shell import spark
from pyspark.sql.types import ArrayType, StructType, StructField, StringType

df = spark.createDataFrame([
    Row(values='[{"text":"Tea"},{"text":"GoldenGlobes"}]'),
    Row(values='[{"text":"GoldenGlobes"}]')
])

schema = ArrayType(StructType([
    StructField('text', StringType())
]))

df \
    .withColumn('array_of_str', f.from_json(f.col('values'), schema).text) \
    .show()

Output:

+--------------------+-------------------+
|              values|       array_of_str|
+--------------------+-------------------+
|[{"text":"Tea"},{...|[Tea, GoldenGlobes]|
|[{"text":"GoldenG...|     [GoldenGlobes]|
+--------------------+-------------------+

If the type of your column is array then something like this should work (not tested):

from pyspark.sql import functions as F
from pyspark.sql import types as T

c = F.array([F.get_json_object(F.col("colname")[0], '$.text')),  
             F.get_json_object(F.col("colname")[1], '$.text'))])

df = df.withColumn("new_col", c)

Or if the length is not fixed (I do not see a solution without an udf) :

F.udf(T.ArrayType())
def get_list(x):
    o_list = []
    for elt in x:
        o_list.append(elt["text"])
    return o_list

df = df.withColumn("new_col", get_list("colname"))

Sharing the Java syntax :

import static org.apache.spark.sql.functions.from_json;
import static org.apache.spark.sql.functions.get_json_object;
import static org.apache.spark.sql.functions.col;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import static org.apache.spark.sql.types.DataTypes.StringType;

Dataset<Row> df = getYourDf();

StructType structschema =
                DataTypes.createStructType(
                        new StructField[] {
                                DataTypes.createStructField("text", StringType, true)
                        });

ArrayType schema = new ArrayType(structschema,true);


df = df.withColumn("array_of_str",from_json(col("colname"), schema).getField("text"));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM