[英]Hash value from Json column in Spark
我有Cassandra
表,在最后一個名為“ fullJson”的列中是JSON
日志文件。 我需要使用MD5
在該JSON
行中哈希userID值。 這是我的方法,但由於某些原因,我總是會在某些時候陷入困境。 加載的Cassandra表:
scala> val rawCass = sc.cassandraTable[cassFormat]("keyspace", "logs").repartition(200)
rawCass: org.apache.spark.rdd.RDD[cassFormat] = MapPartitionsRDD[73] at coalesce at CassandraTableScanRDD.scala:256
我得到:
scala> val cassDF2 = spark.createDataFrame(rawCass).select("fullJson")
cassDF2: org.apache.spark.sql.DataFrame = [fullJson: string]
scala> cassDF2.printSchema
root
|-- fullJson: string (nullable = true)
我的JSON
文件由“ header”和“ body”組成,我猜最好的方法是獲取Data Frame
,然后選擇列userID
並使用MD5
對其進行哈希處理。
scala> val nestedJson = spark.read.json(cassDF2.select("fullJson").rdd.map(_.getString(0))).select("header","body")
nestedJson: org.apache.spark.sql.DataFrame = [header: struct<KPI: string, action: string ... 16 more fields>, body: struct<1MYield: double, 1YYield: double ... 147 more fields>]
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- KPI: string (nullable = true)
| |-- action: string (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- context: string (nullable = true)
| |-- eventID: string (nullable = true)
| |-- interestArea: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- lat: string (nullable = true)
| | |-- lon: string (nullable = true)
| |-- navigationGroup: string (nullable = true)
| |-- sessionID: string (nullable = true)
| |-- timestamp: string (nullable = true)
| |-- userAge: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
| | |-- deviceResolution: string (nullable = true)
| | |-- deviceType: string (nullable = true)
| | |-- deviceVendor: string (nullable = true)
| | |-- os: string (nullable = true)
| | |-- osVersion: string (nullable = true)
| |-- userID: string (nullable = true)
| |-- userSegment: string (nullable = true)
|-- body: struct (nullable = true)
| |-- OS: string (nullable = true)
| |-- active: boolean (nullable = true)
| |-- amount: double (nullable = true)
| |-- amountCritical: string (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountBank: string (nullable = true)
...
現在哈希header.userID
:
val newDF = nestedJson.withColumn("header.userID", md5($"header.userID"))
我想將其保存在CSV
文件中,但是由於它是結構,所以無法完成。
newDF.write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").save("cass_full.csv")
嘗試避免使用struct
類型,但由於其他嵌套而無法使用(例如, location
由lat, lon
)
scala> val tempT = newDF.select($"header.*",$"body.*")
tempT: org.apache.spark.sql.DataFrame = [KPI: string, action: string ... 165 more fields]
scala> tempT.printSchema
root
|-- KPI: string (nullable = true)
|-- action: string (nullable = true)
|-- appID: string (nullable = true)
|-- appVersion: string (nullable = true)
|-- context: string (nullable = true)
|-- eventID: string (nullable = true)
|-- interestArea: string (nullable = true)
|-- location: struct (nullable = true)
| |-- lat: string (nullable = true)
| |-- lon: string (nullable = true)
|-- navigationGroup: string (nullable = true)
...
基本問題什么是最簡單,最可取的方法。 我應該只更改JSON
每一行的userID
值,還是可以使用Data Frames
進行某種方式的更改? 這樣做的原因是因為我有另一個數據庫中的另一個CSV
文件,該文件也需要使用相同的算法進行哈希處理,然后再加入。
請嘗試將其保存在parquet
然后繼續邏輯的第二部分加入。
希望這可以幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.