繁体   English   中英

如何在DataFrame的列中存储numpy.ndarray

[英]How to store numpy.ndarray in the columns of DataFrame

在“结构化流”中,如何使用UDF创建两个新列,该列返回具有两个元素的numpy.ndarray

这是我到目前为止所拥有的:

schema = StructType([
    StructField("host_id", LongType()),
    StructField("fence_id", LongType()),
    StructField("policy_id", LongType()),
    StructField("timestamp", LongType()),
    StructField("distances", ArrayType(LongType()))
])

ds = spark \
    .readStream \
    .format("json") \
    .schema(schema) \
    .load("data/")

ds.printSchema()
pa = PosAlgorithm()
get_distance_udf = udf(lambda y: pa.getLocation(y), ArrayType(LongType()))
dfnew = ds.withColumn("location", get_distance_udf(col("distances")))

query = dfnew \
    .writeStream \
    .format('console') \
    .start()

query.awaitTermination()

函数pa.getLocation返回numpy.ndarray ,例如[42.15999863, 2.08498164] 我想将这些数字存储在DataFrame dfnew两个新列中,称为latitudelongitude

更换

get_distance_udf = udf(lambda y: pa.getLocation(y), ArrayType(LongType()))

get_distance_udf = udf(
     lambda y: pa.getLocation(y).tolist(), 
     StructType([
         StructField("latitude", DoubleType()), 
         StructField("longitude", DoubleType())
     ])
)

然后根据需要扩展结果:

from pyspark.sql.functions import col

(ds
    .withColumn("location", get_distance_udf(col("distances")))
    .withColumn("latitude", col("location.latitude"))
    .withColumn("longitude", col("location.longitude")))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM