简体   繁体   English

Spark中Json列的哈希值

[英]Hash value from Json column in Spark

I have Cassandra table and in last column named "fullJson" is JSON log files. 我有Cassandra表,在最后一个名为“ fullJson”的列中是JSON日志文件。 I need to Hash userID values in that JSON lines using MD5 . 我需要使用MD5在该JSON行中哈希userID值。 Here's my approach, but for some rason I always get stuck at some point. 这是我的方法,但由于某些原因,我总是会在某些时候陷入困境。 Loaded Cassandra table: 加载的Cassandra表:

scala> val rawCass = sc.cassandraTable[cassFormat]("keyspace", "logs").repartition(200)
rawCass: org.apache.spark.rdd.RDD[cassFormat] = MapPartitionsRDD[73] at coalesce at CassandraTableScanRDD.scala:256

And I get: 我得到:

scala> val cassDF2 = spark.createDataFrame(rawCass).select("fullJson")
cassDF2: org.apache.spark.sql.DataFrame = [fullJson: string]

scala> cassDF2.printSchema
root
 |-- fullJson: string (nullable = true)

My JSON file consists of "header" and "body" and I guess best approach is to get Data Frame , then select column userID and hash it with MD5 . 我的JSON文件由“ header”和“ body”组成,我猜最好的方法是获取Data Frame ,然后选择列userID并使用MD5对其进行哈希处理。

scala> val nestedJson = spark.read.json(cassDF2.select("fullJson").rdd.map(_.getString(0))).select("header","body")
nestedJson: org.apache.spark.sql.DataFrame = [header: struct<KPI: string, action: string ... 16 more fields>, body: struct<1MYield: double, 1YYield: double ... 147 more fields>]

scala> nestedJson.printSchema
root
 |-- header: struct (nullable = true)
 |    |-- KPI: string (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- appID: string (nullable = true)
 |    |-- appVersion: string (nullable = true)
 |    |-- context: string (nullable = true)
 |    |-- eventID: string (nullable = true)
 |    |-- interestArea: string (nullable = true)
 |    |-- location: struct (nullable = true)
 |    |    |-- lat: string (nullable = true)
 |    |    |-- lon: string (nullable = true)
 |    |-- navigationGroup: string (nullable = true)
 |    |-- sessionID: string (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |    |-- userAge: string (nullable = true)
 |    |-- userAgent: struct (nullable = true)
 |    |    |-- browser: string (nullable = true)
 |    |    |-- browserVersion: string (nullable = true)
 |    |    |-- deviceName: string (nullable = true)
 |    |    |-- deviceResolution: string (nullable = true)
 |    |    |-- deviceType: string (nullable = true)
 |    |    |-- deviceVendor: string (nullable = true)
 |    |    |-- os: string (nullable = true)
 |    |    |-- osVersion: string (nullable = true)
 |    |-- userID: string (nullable = true)
 |    |-- userSegment: string (nullable = true)
 |-- body: struct (nullable = true)
 |    |-- OS: string (nullable = true)
 |    |-- active: boolean (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- amountCritical: string (nullable = true)
 |    |-- beneficiary: struct (nullable = true)
 |    |    |-- beneficiaryAccounts: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- beneficiaryAccountBank: string (nullable = true)
...

Now to hash header.userID : 现在哈希header.userID

val newDF = nestedJson.withColumn("header.userID", md5($"header.userID"))

I want to save that in CSV file, but it can not be done because it's a struct. 我想将其保存在CSV文件中,但是由于它是结构,所以无法完成。

newDF.write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").save("cass_full.csv")

Tried to avoid struct type but couldn't because of other nesting (eg location consists of lat, lon ) 尝试避免使用struct类型,但由于其他嵌套而无法使用(例如, locationlat, lon

scala> val tempT = newDF.select($"header.*",$"body.*")
tempT: org.apache.spark.sql.DataFrame = [KPI: string, action: string ... 165 more fields]

scala> tempT.printSchema
root
 |-- KPI: string (nullable = true)
 |-- action: string (nullable = true)
 |-- appID: string (nullable = true)
 |-- appVersion: string (nullable = true)
 |-- context: string (nullable = true)
 |-- eventID: string (nullable = true)
 |-- interestArea: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- lat: string (nullable = true)
 |    |-- lon: string (nullable = true)
 |-- navigationGroup: string (nullable = true)
...

Basic question What's the easiest and most preferable way to do this. 基本问题什么是最简单,最可取的方法。 Should I just change userID value for every line in JSON , or it can be done somehow different with Data Frames ? 我应该只更改JSON每一行的userID值,还是可以使用Data Frames进行某种方式的更改? Reason to do it is because I have another CSV file from another database that needs to also be hashed with same algorithm and joined afterwards. 这样做的原因是因为我有另一个数据库中的另一个CSV文件,该文件也需要使用相同的算法进行哈希处理,然后再加入。

Please try saving this in parquet and move on with the second part of your logic to join. 请尝试将其保存在parquet然后继续逻辑的第二部分加入。

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM