简体   繁体   English

有没有办法在火花流中展平嵌套的 JSON ?

[英]Is there any way to flatten the nested JSON in spark streaming?

I have written a dataset spark job(Batch) code to flatten the data, which is working fine, but when i tried to use tha same code snippet in spark streaming jobs,its is throwing following error Queries with streaming sources must be executed with writeStream.start();我已经编写了一个数据集火花作业(批处理)代码来展平数据,它工作正常,但是当我尝试在火花流作业中使用相同的代码片段时,它会抛出以下错误 必须使用 writeStream 执行流式源的查询。开始();

So is there any ways to flatten Nested JSON in Streaming jobs?那么有什么方法可以在流作业中展平嵌套的 JSON 吗? sample input Nested JSON -样本输入嵌套 JSON -

{
   "name":" Akash",
   "age":26,
   "watches":{
      "name":"Apple",
      "models":[
         "Apple Watch Series 5",
         "Apple Watch Nike"
      ]
   },
   "phones":[
      {
         "name":" Apple",
         "models":[
            "iphone X",
            "iphone XR",
            "iphone XS",
            "iphone 11",
            "iphone 11 Pro"
         ]
      },
      {
         "name":" Samsung",
         "models":[
            "Galaxy Note10",
            "Galaxy Note10+",
            "Galaxy S10e",
            "Galaxy S10",
            "Galaxy S10+"
         ]
      },
      {
         "name":" Google",
         "models":[
            "Pixel 3",
            "Pixel 3a"
         ]
      }
   ]
}

Expected output.预期 output。 output after falttening output 折后

below is the code snippet.下面是代码片段。

private static org.apache.spark.sql.Dataset flattenJSONdf(
            org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> ds) {
        org.apache.spark.sql.types.StructField[] fields = ds.schema().fields();
        java.util.List<String> fieldsNames = new java.util.ArrayList<>();
        for (org.apache.spark.sql.types.StructField s : fields) {
            fieldsNames.add(s.name());
        }

        for (int i = 0; i < fields.length; i++) {

            org.apache.spark.sql.types.StructField field = fields[i];
            org.apache.spark.sql.types.DataType fieldType = field.dataType();
            String fieldName = field.name();

            if (fieldType instanceof org.apache.spark.sql.types.ArrayType) {
                java.util.List<String> fieldNamesExcludingArray = new java.util.ArrayList<String>();
                for (String fieldName_index : fieldsNames) {
                    if (!fieldName.equals(fieldName_index))
                        fieldNamesExcludingArray.add(fieldName_index);
                }

                java.util.List<String> fieldNamesAndExplode = new java.util.ArrayList<>(
                        fieldNamesExcludingArray);
                String s = String.format("explode_outer(%s) as %s", fieldName,
                        fieldName);
                fieldNamesAndExplode.add(s);

                String[] exFieldsWithArray = new String[fieldNamesAndExplode
                        .size()];
                org.apache.spark.sql.Dataset exploded_ds = ds
                        .selectExpr(fieldNamesAndExplode
                                .toArray(exFieldsWithArray));

                // explodedDf.show();

                return flattenJSONdf(exploded_ds);

            } else if (fieldType instanceof org.apache.spark.sql.types.StructType) {

                String[] childFieldnames_struct = ((org.apache.spark.sql.types.StructType) fieldType)
                        .fieldNames();

                java.util.List<String> childFieldnames = new java.util.ArrayList<>();
                for (String childName : childFieldnames_struct) {
                    childFieldnames.add(fieldName + "." + childName);
                }

                java.util.List<String> newfieldNames = new java.util.ArrayList<>();
                for (String fieldName_index : fieldsNames) {
                    if (!fieldName.equals(fieldName_index))
                        newfieldNames.add(fieldName_index);
                }

                newfieldNames.addAll(childFieldnames);

                java.util.List<org.apache.spark.sql.Column> renamedStrutctCols = new java.util.ArrayList<>();

                for (String newFieldNames_index : newfieldNames) {
                    renamedStrutctCols.add(new org.apache.spark.sql.Column(
                            newFieldNames_index.toString())
                            .as(newFieldNames_index.toString()
                                    .replace(".", "_")));
                }

                scala.collection.Seq renamedStructCols_seq = scala.collection.JavaConverters
                        .collectionAsScalaIterableConverter(renamedStrutctCols)
                        .asScala().toSeq();

                org.apache.spark.sql.Dataset ds_struct = ds
                        .select(renamedStructCols_seq);

                return flattenJSONdf(ds_struct);
            }

        }
        return ds;
    }

Note code is in scala & I have used Spark Structured Streaming . Note代码在scala中,我使用了Spark Structured Streaming

You can use org.apache.spark.sql.functions.explode function to flatten array columns.您可以使用org.apache.spark.sql.functions.explode function 来展平数组列。 Please check the below code.请检查以下代码。


scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala>  val schema = DataType.fromJson("""{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"phones","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"models","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}}]},"containsNull":true},"nullable":true,"metadata":{}},{"name":"watches","type":{"type":"struct","fields":[{"name":"models","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
schema: org.apache.spark.sql.types.StructType = StructType(StructField(age,LongType,true), StructField(name,StringType,true), StructField(phones,ArrayType(StructType(StructField(models,ArrayType(StringType,true),true), StructField(name,StringType,true)),true),true), StructField(watches,StructType(StructField(models,ArrayType(StringType,true),true), StructField(name,StringType,true)),true))

scala> val streamDF = spark.readStream.format("json").schema(schema).load("/tmp/jdata")
streamDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

streamDF
.withColumn("watches_models",explode($"watches.models")).withColumn("watches_name",$"watches.name")
.withColumn("phones_models",explode($"phones.models")).withColumn("phones_models",explode($"phones_models"))
.withColumn("phones_name",explode($"phones.name"))
.drop("watches","phones")
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()

// Exiting paste mode, now interpreting.

-------------------------------------------
Batch: 0
-------------------------------------------
+---+------+--------------------+------------+--------------+-----------+
|age|  name|      watches_models|watches_name| phones_models|phones_name|
+---+------+--------------------+------------+--------------+-----------+
| 26| Akash|Apple Watch Series 5|       Apple|      iphone X|      Apple|
| 26| Akash|Apple Watch Series 5|       Apple|      iphone X|    Samsung|
| 26| Akash|Apple Watch Series 5|       Apple|      iphone X|     Google|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone XR|      Apple|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone XR|    Samsung|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone XR|     Google|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone XS|      Apple|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone XS|    Samsung|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone XS|     Google|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone 11|      Apple|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone 11|    Samsung|
| 26| Akash|Apple Watch Series 5|       Apple|     iphone 11|     Google|
| 26| Akash|Apple Watch Series 5|       Apple| iphone 11 Pro|      Apple|
| 26| Akash|Apple Watch Series 5|       Apple| iphone 11 Pro|    Samsung|
| 26| Akash|Apple Watch Series 5|       Apple| iphone 11 Pro|     Google|
| 26| Akash|Apple Watch Series 5|       Apple| Galaxy Note10|      Apple|
| 26| Akash|Apple Watch Series 5|       Apple| Galaxy Note10|    Samsung|
| 26| Akash|Apple Watch Series 5|       Apple| Galaxy Note10|     Google|
| 26| Akash|Apple Watch Series 5|       Apple|Galaxy Note10+|      Apple|
| 26| Akash|Apple Watch Series 5|       Apple|Galaxy Note10+|    Samsung|
+---+------+--------------------+------------+--------------+-----------+
only showing top 20 rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Spark Streaming Java 中提取嵌套的 JSON 值 - Extracting nested JSON values in Spark Streaming Java 使用spark展平嵌套的json文档并加载到Elasticsearch中 - Flatten a nested json document using spark and load into Elasticsearch 如何使用java动态展平spark数据帧中复杂的嵌套json - how to flatten complex nested json in spark dataframe using java dynamically 有没有办法将这些嵌套的 IntStreams 合并为一个? 有更短的方法吗? - Is there any way to flatten these nested IntStreams into one? Is there shorter way? Java Spark Streaming JSON解析 - Java Spark Streaming JSON parsing Spark:将KafkaProducer广播到Spark流媒体的最佳方法 - Spark : Best way to Broadcast KafkaProducer to Spark streaming 如何通过 Jayway JsonPath 用嵌套列表展平 json? - How to flatten a json with nested lists by Jayway JsonPath? 有没有一种简单的方法可以将 json:api 对 Android 的响应展平? - Is there an easy way to flatten a json:api response on Android? Spark 结构化流:将行转换为 json - Spark structured streaming: converting row to json 如何展平JavaPairDStream <string,ArrayList<string> &gt;进入JavaDStream <string> 在Java Spark Streaming中 - How to flatten JavaPairDStream<string,ArrayList<string>> into JavaDStream<string> in java spark streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM