简体   繁体   English

使用本机列函数在 Spark 中解码二进制数据

[英]Decode binary data in Spark with native column functions

I have an column of type binary .我有一列binary The values are 4 bytes long, and I would like to interpret them as an Int.这些值有 4 个字节长,我想将它们解释为 Int。 An example DataFrame looks like this:一个示例 DataFrame 如下所示:

val df = Seq(
  (Array(0x00.toByte, 0x00.toByte, 0x02.toByte, 0xe6.toByte))
  ).toDF("binary_value")

Where the 4 bytes in this example can be interpreted as an U32 to form the number 742. Using a UDF the value can be decoded like this:此示例中的 4 个字节可以解释为 U32 以形成数字 742。使用 UDF 可以将值解码如下:

val bytesToInt = udf((x: Array[Byte]) => BigInt(x).toInt)

df.withColumn("numerical_value", bytesToInt('binary_value))

It works, but at the cost of using a UDF and corresponding serialization / deserialization overhead.它可以工作,但代价是使用 UDF 和相应的序列化/反序列化开销。 I was hoping to do something like 'binary_value.cast("array<byte>") and take it from there, or even 'binary_value.cast("int") , but Spark doesn't allow it.我希望做类似'binary_value.cast("array<byte>")的事情并从那里拿走,甚至是'binary_value.cast("int") ,但 Spark 不允许这样做。

Is there a way to interpret the binary column to another data type using Spark native functions ?有没有办法使用Spark 本机函数将二进制列解释为另一种数据类型?

One way could be converting to hex (using hex ) and then to dec (using conv ).一种方法是转换为十六进制(使用hex ),然后转换为 dec (使用conv )。

conv(hex($"binary_value"), 16, 10)
df.withColumn("numerical_value", conv(hex($"binary_value"), 16, 10)).show()
// +-------------+---------------+
// | binary_value|numerical_value|
// +-------------+---------------+
// |[00 00 02 E6]|            742|
// +-------------+---------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM