[英]Hash value from Json column in Spark
I have Cassandra
table and in last column named "fullJson" is JSON
log files. 我有
Cassandra
表,在最后一个名为“ fullJson”的列中是JSON
日志文件。 I need to Hash userID values in that JSON
lines using MD5
. 我需要使用
MD5
在该JSON
行中哈希userID值。 Here's my approach, but for some rason I always get stuck at some point. 这是我的方法,但由于某些原因,我总是会在某些时候陷入困境。 Loaded Cassandra table:
加载的Cassandra表:
scala> val rawCass = sc.cassandraTable[cassFormat]("keyspace", "logs").repartition(200)
rawCass: org.apache.spark.rdd.RDD[cassFormat] = MapPartitionsRDD[73] at coalesce at CassandraTableScanRDD.scala:256
And I get: 我得到:
scala> val cassDF2 = spark.createDataFrame(rawCass).select("fullJson")
cassDF2: org.apache.spark.sql.DataFrame = [fullJson: string]
scala> cassDF2.printSchema
root
|-- fullJson: string (nullable = true)
My JSON
file consists of "header" and "body" and I guess best approach is to get Data Frame
, then select column userID
and hash it with MD5
. 我的
JSON
文件由“ header”和“ body”组成,我猜最好的方法是获取Data Frame
,然后选择列userID
并使用MD5
对其进行哈希处理。
scala> val nestedJson = spark.read.json(cassDF2.select("fullJson").rdd.map(_.getString(0))).select("header","body")
nestedJson: org.apache.spark.sql.DataFrame = [header: struct<KPI: string, action: string ... 16 more fields>, body: struct<1MYield: double, 1YYield: double ... 147 more fields>]
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- KPI: string (nullable = true)
| |-- action: string (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- context: string (nullable = true)
| |-- eventID: string (nullable = true)
| |-- interestArea: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- lat: string (nullable = true)
| | |-- lon: string (nullable = true)
| |-- navigationGroup: string (nullable = true)
| |-- sessionID: string (nullable = true)
| |-- timestamp: string (nullable = true)
| |-- userAge: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
| | |-- deviceResolution: string (nullable = true)
| | |-- deviceType: string (nullable = true)
| | |-- deviceVendor: string (nullable = true)
| | |-- os: string (nullable = true)
| | |-- osVersion: string (nullable = true)
| |-- userID: string (nullable = true)
| |-- userSegment: string (nullable = true)
|-- body: struct (nullable = true)
| |-- OS: string (nullable = true)
| |-- active: boolean (nullable = true)
| |-- amount: double (nullable = true)
| |-- amountCritical: string (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountBank: string (nullable = true)
...
Now to hash header.userID
: 现在哈希
header.userID
:
val newDF = nestedJson.withColumn("header.userID", md5($"header.userID"))
I want to save that in CSV
file, but it can not be done because it's a struct. 我想将其保存在
CSV
文件中,但是由于它是结构,所以无法完成。
newDF.write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").save("cass_full.csv")
Tried to avoid struct
type but couldn't because of other nesting (eg location
consists of lat, lon
) 尝试避免使用
struct
类型,但由于其他嵌套而无法使用(例如, location
由lat, lon
)
scala> val tempT = newDF.select($"header.*",$"body.*")
tempT: org.apache.spark.sql.DataFrame = [KPI: string, action: string ... 165 more fields]
scala> tempT.printSchema
root
|-- KPI: string (nullable = true)
|-- action: string (nullable = true)
|-- appID: string (nullable = true)
|-- appVersion: string (nullable = true)
|-- context: string (nullable = true)
|-- eventID: string (nullable = true)
|-- interestArea: string (nullable = true)
|-- location: struct (nullable = true)
| |-- lat: string (nullable = true)
| |-- lon: string (nullable = true)
|-- navigationGroup: string (nullable = true)
...
Basic question What's the easiest and most preferable way to do this. 基本问题什么是最简单,最可取的方法。 Should I just change
userID
value for every line in JSON
, or it can be done somehow different with Data Frames
? 我应该只更改
JSON
每一行的userID
值,还是可以使用Data Frames
进行某种方式的更改? Reason to do it is because I have another CSV
file from another database that needs to also be hashed with same algorithm and joined afterwards. 这样做的原因是因为我有另一个数据库中的另一个
CSV
文件,该文件也需要使用相同的算法进行哈希处理,然后再加入。
Please try saving this in parquet
and move on with the second part of your logic to join. 请尝试将其保存在
parquet
然后继续逻辑的第二部分加入。
Hope this helps! 希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.