简体   繁体   English

两个pyspark Dataframe列之间的有效匹配

[英]Efficient matching between two pyspark Dataframe columns

I have a pyspark Dataframe with the following schema : 我有一个具有以下架构的pyspark数据框:

root
 |-- id: integer (nullable = true)
 |-- url: string (nullable = true)
 |-- cosine_vec: vector (nullable = true)
 |-- similar_url: array (nullable = true)
 |    |-- element: integer (containsNull = true)

similar_url is a column that contains arrays of integers. same_url是包含整数数组的列。 These integers refer to the id column. 这些整数引用id列。

For example : 例如 :

+----+--------------------+--------------------+--------------------+
|  id|                 url|                 vec|         similar_url|
+----+--------------------+--------------------+--------------------+
|  26|https://url_26......|[0.81382234943025...|[1724, 911, 1262,...|
+----+--------------------+--------------------+--------------------+

I want to replace the value 1724 in similar_url by the url at the row with the id 1724. 我想用id为 1724的行上的url替换same_url中的值1724。

That's for the example. 这是示例。 My problem is that I would like to perform this for every row, efficiently. 我的问题是我想高效地对每一行执行此操作。

The output would look like this : 输出看起来像这样:

+----+--------------------+--------------------+--------------------+
|  id|                 url|                 vec|         similar_url|
+----+--------------------+--------------------+--------------------+
|  26|https://url_26......|[0.81382234943025...|[https://url_1724...|
+----+--------------------+--------------------+--------------------+

Do you have any thoughts ? 你有什么想法吗?

I create a small sample dataframe based on your explanations : 我根据您的说明创建一个小的示例数据框:

from pyspark.sql import functions as F, types as T

df = spark.createDataFrame(
    [
        (1, "url_1", [0.3,0.6,], [2,3]),
        (2, "url_2", [0.3,0.5,], [1,3]),
        (3, "url_3", [0.6,0.5,], [1,2]),
    ],
    ["id", "url", "vec", "similar_url"]
)

df.show()
+---+-----+----------+-----------+
| id|  url|       vec|similar_url|
+---+-----+----------+-----------+
|  1|url_1|[0.3, 0.6]|     [2, 3]|
|  2|url_2|[0.3, 0.5]|     [1, 3]|
|  3|url_3|[0.6, 0.5]|     [1, 2]|
+---+-----+----------+-----------+

If you are using >2.4 spark version, there is a function called "arrays_zip" that you can use to replace my UDF : 如果您使用的是> 2.4 spark版本,则有一个名为“ arrays_zip”的函数可用于替换我的UDF:

outType = T.ArrayType(
    T.StructType([
        T.StructField("vec",T.FloatType(), True),
        T.StructField("similar_url",T.IntegerType(), True),
    ]))

@F.udf(outType)
def arrays_zip(vec, similar_url):
    return zip(vec, similar_url)

then you can process your data : 然后您可以处理数据:

df.withColumn(
    "zips",
    arrays_zip(F.col("vec"), F.col("similar_url"))
).withColumn(
    "zip",
    F.explode("zips")
).alias("df").join(
    df.alias("df_2"),
    F.col("df_2.id") == F.col("df.zip.similar_url")
).groupBy("df.id", "df.url").agg(
    F.collect_list("df.zip.vec").alias("vec"),
    F.collect_list("df_2.url").alias("similar_url"),
).show()

+---+-----+----------+--------------+                                           
| id|  url|       vec|   similar_url|
+---+-----+----------+--------------+
|  3|url_3|[0.6, 0.5]|[url_1, url_2]|
|  2|url_2|[0.3, 0.5]|[url_1, url_3]|
|  1|url_1|[0.6, 0.3]|[url_3, url_2]|
+---+-----+----------+--------------+

If you wanna keep the order, you need to do a bit more of manipulation : 如果您想保留订单,则需要多做一些操作:

@F.udf(T.ArrayType(T.FloatType()))
def get_vec(new_list):
    new_list.sort(key=lambda x : x[0])
    out_list = [x[1] for x in new_list]
    return out_list

@F.udf(T.ArrayType(T.StringType()))
def get_similar_url(new_list):
    new_list.sort(key=lambda x : x[0])
    out_list = [x[2] for x in new_list]
    return out_list

df.withColumn(
    "zips",
    arrays_zip(F.col("vec"), F.col("similar_url"))
).select(
    "id", 
    "url", 
    F.posexplode("zips")
).alias("df").join(
    df.alias("df_2"),
    F.col("df_2.id") == F.col("df.col.similar_url")
).select(
    "df.id",
    "df.url",
    F.struct(
        F.col("df.pos").alias("pos"),
        F.col("df.col.vec").alias("vec"),
        F.col("df_2.url").alias("similar_url"),
    ).alias("new_struct")
).groupBy(
    "id",
    "url"
).agg(
    F.collect_list("new_struct").alias("new_list")
).select(
    "id",
    "url",
    get_vec(F.col("new_list")).alias("vec"),
    get_similar_url(F.col("new_list")).alias("similar_url"),
).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM