[英]Decode binary data in Spark with native column functions
I have an column of type binary
.我有一列
binary
。 The values are 4 bytes long, and I would like to interpret them as an Int.这些值有 4 个字节长,我想将它们解释为 Int。 An example DataFrame looks like this:
一个示例 DataFrame 如下所示:
val df = Seq(
(Array(0x00.toByte, 0x00.toByte, 0x02.toByte, 0xe6.toByte))
).toDF("binary_value")
Where the 4 bytes in this example can be interpreted as an U32 to form the number 742. Using a UDF the value can be decoded like this:此示例中的 4 个字节可以解释为 U32 以形成数字 742。使用 UDF 可以将值解码如下:
val bytesToInt = udf((x: Array[Byte]) => BigInt(x).toInt)
df.withColumn("numerical_value", bytesToInt('binary_value))
It works, but at the cost of using a UDF and corresponding serialization / deserialization overhead.它可以工作,但代价是使用 UDF 和相应的序列化/反序列化开销。 I was hoping to do something like
'binary_value.cast("array<byte>")
and take it from there, or even 'binary_value.cast("int")
, but Spark doesn't allow it.我希望做类似
'binary_value.cast("array<byte>")
的事情并从那里拿走,甚至是'binary_value.cast("int")
,但 Spark 不允许这样做。
Is there a way to interpret the binary column to another data type using Spark native functions ?有没有办法使用Spark 本机函数将二进制列解释为另一种数据类型?
One way could be converting to hex (using hex
) and then to dec (using conv
).一种方法是转换为十六进制(使用
hex
),然后转换为 dec (使用conv
)。
conv(hex($"binary_value"), 16, 10)
df.withColumn("numerical_value", conv(hex($"binary_value"), 16, 10)).show()
// +-------------+---------------+
// | binary_value|numerical_value|
// +-------------+---------------+
// |[00 00 02 E6]| 742|
// +-------------+---------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.