简体   繁体   English

使用JSON模式更新Spark DataFrame中的列

[英]Updating column in spark dataframe with json schema

I have json files, and I'm trying to hash one field of it with SHA 256. These files are on AWS S3. 我有json文件,我想用SHA 256对其哈希一个字段。这些文件在AWS S3上。 I am currently using spark with python on Apache Zeppelin. 我目前在Apache Zeppelin上将python与spark一起使用。

Here is my json schema, I am trying to hash 'mac' field; 这是我的json模式,我正在尝试对“ mac”字段进行哈希处理;

 |-- Document: struct (nullable = true)
 |    |-- data: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- mac: string (nullable = true)

I've tried couple of things; 我尝试了几件事;

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib  

hcData = sqlc.read.option("inferSchema","true").json(inputPath)
hcData.registerTempTable("hcData")


name = 'Document'
udf = UserDefinedFunction(lambda x: hashlib.sha256(str(x).encode('utf-8')).hexdigest(), StringType())
new_df = hcData.select(*[udf(column).alias(name) if column == name else column for column in hcData.columns])

This code works fine. 此代码可以正常工作。 But when I try to hash mac field and change name variable nothing happens; 但是,当我尝试哈希mac字段并更改名称变量时,什么也没有发生;

name = 'Document.data[0].mac'
name = 'mac'

I guess it is because, it couldn't find column with given name. 我猜这是因为,找不到指定名称的列。

I've tried to change the code a bit; 我已经尝试过更改代码;

def valueToCategory(value):
    return hashlib.sha256(str(value).encode('utf-8')).hexdigest()


udfValueToCategory = udf(valueToCategory, StringType())
df = hcData.withColumn("Document.data[0].mac",udfValueToCategory("Document.data.mac"))

This code hashes "Document.data.mac" and creates new column with hashed mac addresses. 此代码对“ Document.data.mac”进行哈希处理,并使用哈希的MAC地址创建新列 I want to update existing column. 我想更新现有的列。 For those variables not nested it can update, there is no problem, but for nested variables I couldn't find a way to update. 对于那些未嵌套的变量,它可以更新,没有问题,但是对于嵌套变量,我找不到更新方法。

So basically, I want to hash a field in nested json file with spark python. 所以基本上,我想用spark python哈希嵌套json文件中的字段。 Can anyone knows how to update spark dataframe with schema ? 谁能知道如何使用架构更新spark数据框?

Well, I've found a solution for my question with scala . 好吧,我已经找到了有关scala的问题的解决方案。 There can be redundant codes but it worked anyway. 可以有多余的代码,但是还是可以的。

import scala.util.matching.Regex
import java.security.MessageDigest

val inputPath = ""
val outputPath = ""

//finds mac addresses with given regex
def find(s: String, r: Regex): List[String] = {
    val l = r.findAllIn(s).toList
    if(!l.isEmpty){ 
        return l
    } else {
        val lis: List[String] = List("null")
        return lis
    }
}

//hashes given string with sha256
def hash(s: String): String = {
    return MessageDigest.getInstance("SHA-256").digest(s.getBytes).map(0xFF & _).map { "%02x".format(_) }.foldLeft(""){_ + _}
}

//hashes given line
def hashAll(s: String, r:Regex): String = {
    var st = s
    val macs = find(s, r)
    for (mac <- macs){
        st = st.replaceAll(mac, hash(mac))
    }
    return st
}

//read data
val rdd = sc.textFile(inputPath)

//mac address regular expression
val regex = "(([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2}))".r

//hash data
val hashed_rdd = rdd.map(line => hashAll(line, regex))

//write hashed data
hashed_rdd.saveAsTextFile(outputPath)

Here is the python solution for my question below. 这是下面我的问题的python解决方案。

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib  
import re


def find(s, r):
    l = re.findall(r, s)
    if(len(l)!=0):
        return l
    else:
        lis = ["null"]
        return lis



def hash(s):
    return hashlib.sha256(str(s).encode('utf-8')).hexdigest()



def hashAll(s, r):
    st = s
    macs = re.findall(r, s)
    for mac in macs:
        st = st.replace(mac, hash(mac))
    return st


rdd = sc.textFile(inputPath)

regex = "([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2})"
hashed_rdd = rdd.map(lambda line: hashAll(line, regex))

hashed_rdd.saveAsTextFile(outputPath)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM