简体   繁体   English

Spark流:非结构化记录

[英]Spark streaming: un structured records

We are consuming data from EventHub using streaming. 我们正在使用流媒体从EventHub消费数据。 Incoming stream contains JSON records of various types (around 400 different types) 传入流包含各种类型的JSON记录(大约400种不同类型)

Each record will be categorized using ProductId property. 每个记录将使用ProductId属性进行分类。

example (Incoming stream of records): 示例(传入记录流):

record1 - { productId: 101, colA: "some-val", colB: "some-val" }
record2 - { productId: 104, colC: "some-val", colD: "some-val" }
record3 - { productId: 202, colA: "some-val", colD: "some-val", colF: "some-val" }
record3 - { productId: 342, colH: "some-val", colJ: "some-val", colK: "some-val" }

Number of properties in each record varies, but record having similar productId will have exact same number of properties. 每个记录中的属性数量各不相同,但是具有相似productId的记录将具有完全相同的属性数量。

ProductId ranges from (1 - 400), number of properties in record will be upto 50. ProductId的范围是(1-400),记录中的属性数最多为50。

I want to read above stream of JSON record & write to different parquet/delta locations like 我想阅读上面的JSON记录流并写入不同的镶木地板/三角洲位置,例如

    Location(Delta/Parquet)             Records
    -----------------------------------------------------------------
    /mnt/product-101        Contains all records with productId - 101
    /mnt/product-104        Contains all records with productId - 104
    /mnt/product-202        Contains all records with productId - 202
    /mnt/product-342        Contains all records with productId - 342

1) How to create DataFrame/Dataset from stream containing different types of record? 1)如何从包含不同类型的记录的流中创建DataFrame / Dataset?

2) Will it be possible using single spark stream & writing to different delta/parquet location ? 2)是否可以使用单个火花流并写入不同的三角洲/镶木地板位置?

Please note the using this method should work however it will generate lots of sparse data. 请注意,使用此方法应该可以,但是会生成很多稀疏数据。

First create a StructType with all the columns combined. 首先创建一个结合了所有列的StructType。

  val schema = new StructType().add("productId", LongType)
  .add("colA", StringType).add("colB", StringType)
  .add("colC", StringType)....

Then create a stream using this schema and the from_json function. 然后使用此架构和from_json函数创建流。

val df1 = df.select(from_json('value, schema).alias("tmp"))
.select("tmp.*")

Finally use partitionBy to write partitioned parquet files. 最后使用partitionBy编写分区的镶木地板文件。

val query1 = df1 
.writeStream 
.format("parquet") 
.option("path", "/mnt/product/") 
.option("checkpointLocation","/tmp/checkpoint")
.partitionBy("productId").start()

This will generate rows with all the columns. 这将生成包含所有列的行。 The columns that weren't originally in the json input will be marked as null. 最初不在json输入中的列将被标记为null。 Parquet supports writing nulls. Parquet支持写入空值。 But it would be better if you filter them first. 但是最好先过滤一下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM