Pyspark 在 function 的基礎上加入 dataframe

Question

我有 2 個看起來像這樣的數據框

網絡

+----------------+-------+
|    Network     | VLAN  |
+----------------+-------+
| 192.168.1.0/24 | VLAN1 |
| 192.168.2.0/24 | VLAN2 |
+----------------+-------+

流動

+--------------+----------------+
|  source_ip   | destination_ip |
+--------------+----------------+
| 192.168.1.11 | 192.168.2.13   |
+--------------+----------------+

理想情況下，我想得到這樣的東西

+--------------+----------------+-------------+------------------+
|  source_ip   | destination_ip | source_vlan | destination_vlan |
+--------------+----------------+-------------+------------------+
| 192.168.1.11 | 192.168.2.13   | VLAN1       | VLAN2            |
+--------------+----------------+-------------+------------------+

不幸的是，流 dataframe 不包含子網掩碼。 到目前為止我在沒有 pyspark 的情況下嘗試了什么

獲取一個不同的子網列表（在本例中 [24]）
對於每個子網和 source_ip 計算網絡ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
搜索 dataframe "networks" 中是否存在該子網並返回 VLAN，否則返回空字符串

我試圖用 pyspark 做類似的方法，但它不起作用，因為我認為可能有更好的方法嗎？

def get_available_subnets(df):
    split_col = split(df['network'], '/')
    df = df.withColumn('sub', split_col.getItem(1))
    return df.select('sub').distinct()

def get_vlan_by_ip(ip, infoblox, subnets):
    for sub in subnets:
        net = ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
        if net:
            search = infoblox.filter(infoblox.network == str(net))

            if not search.head(1).isEmpty():
                return search.first.vlan
    return hashlib.sha1(str.encode(ip)).hexdigest()

subnets = get_available_subnets(infoblox_networks_df).select('sub').rdd.flatMap(lambda x: x).collect()


short = flows_filtered_prepared_df.limit(1000)


partial_vlan_func = partial(get_vlan_by_ip, infoblox=infoblox_networks_df, subnets=subnets)
get_vlan_udf = udf(lambda ip: partial_vlan_func(ip), StringType())

short.select('source_ip', 'destination_ip', get_vlan_udf('source_ip').alias('source_vlan')).show()

Answer 1

這種方法完全避免使用udf ，利用split和slice ，但也許有更好的方法。 這種方法的好處是它直接利用了子網掩碼中存在的位，並且它純粹是在PySpark中編寫的。

解決方案的上下文： IP 地址可以被子網拆分和屏蔽。 這意味着8, 16, 24, 32告訴您 IP 的哪些部分很重要 - 這會促使除以 8 並使用結果列將 IP 地址ArrayType列從其原始StringType拆分出來。

注意： pyspark.sql.functions.slice將在較新版本的PySpark >= 2.4中工作，一些舊版本需要使用f.expr("slice(...)")

設置：

flows = spark.createDataFrame([
    (1, "192.168.1.1", "192.168.2.1"),
    (2, "192.168.2.1", "192.168.3.1"), 
    (3, "192.168.3.1", "192.168.1.1"), 
], ['id', 'source_ip', 'destination_ip'] 
)
networks = spark.createDataFrame([
    (1, "192.168.1.0/24", "VLAN1"),
    (2, "192.168.2.0/24", "VLAN2"), 
    (3, "192.168.3.0/24", "VLAN3"), 
], ['id', 'network', 'vlan'] 
)

一些預處理：

networks_split = networks.select(
    "*",
    (f.split(f.col("network"), "/")[1] / 8).cast("int").alias("bits"),
    f.split(f.split(f.col("network"), "/")[0], "\.").alias('segmented_ip')
)
networks_split.show()
+---+--------------+-----+----+----------------+
| id|       network| vlan|bits|    segmented_ip|
+---+--------------+-----+----+----------------+
|  1|192.168.1.0/24|VLAN1|   3|[192, 168, 1, 0]|
|  2|192.168.2.0/24|VLAN2|   3|[192, 168, 2, 0]|
|  3|192.168.3.0/24|VLAN3|   3|[192, 168, 3, 0]|
+---+--------------+-----+----+----------------+

networks_masked = networks_split.select(
    "*",
    f.expr("slice(segmented_ip, 1, bits)").alias("masked_bits"),
)
networks_masked.show()
+---+--------------+-----+----+----------------+-------------+
| id|       network| vlan|bits|    segmented_ip|  masked_bits|
+---+--------------+-----+----+----------------+-------------+
|  1|192.168.1.0/24|VLAN1|   3|[192, 168, 1, 0]|[192, 168, 1]|
|  2|192.168.2.0/24|VLAN2|   3|[192, 168, 2, 0]|[192, 168, 2]|
|  3|192.168.3.0/24|VLAN3|   3|[192, 168, 3, 0]|[192, 168, 3]|
+---+--------------+-----+----+----------------+-------------+

flows_split = flows.select(
    "*",
    f.split(f.split(f.col("source_ip"), "/")[0], "\.").alias('segmented_source_ip'),
    f.split(f.split(f.col("destination_ip"), "/")[0], "\.").alias('segmented_destination_ip')
)
flows_split.show()
+---+-----------+--------------+-------------------+------------------------+
| id|  source_ip|destination_ip|segmented_source_ip|segmented_destination_ip|
+---+-----------+--------------+-------------------+------------------------+
|  1|192.168.1.1|   192.168.2.1|   [192, 168, 1, 1]|        [192, 168, 2, 1]|
|  2|192.168.2.1|   192.168.3.1|   [192, 168, 2, 1]|        [192, 168, 3, 1]|
|  3|192.168.3.1|   192.168.1.1|   [192, 168, 3, 1]|        [192, 168, 1, 1]|
+---+-----------+--------------+-------------------+------------------------+

最后，我根據掩碼的bits對切片進行crossJoin和過濾，例如：

flows_split.crossJoin(
    networks_masked.select("vlan", "bits", "masked_bits")
).where(
    f.expr("slice(segmented_source_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id|  source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits|  masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
|  1|192.168.1.1|   192.168.2.1|   [192, 168, 1, 1]|        [192, 168, 2, 1]|VLAN1|   3|[192, 168, 1]|
|  2|192.168.2.1|   192.168.3.1|   [192, 168, 2, 1]|        [192, 168, 3, 1]|VLAN2|   3|[192, 168, 2]|
|  3|192.168.3.1|   192.168.1.1|   [192, 168, 3, 1]|        [192, 168, 1, 1]|VLAN3|   3|[192, 168, 3]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+

可以對destination_ip執行完全相同的方法，例如：

flows_split.crossJoin(
    networks_masked.select("vlan", "bits", "masked_bits")
).where(
    f.expr("slice(segmented_destination_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id|  source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits|  masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
|  1|192.168.1.1|   192.168.2.1|   [192, 168, 1, 1]|        [192, 168, 2, 1]|VLAN2|   3|[192, 168, 2]|
|  2|192.168.2.1|   192.168.3.1|   [192, 168, 2, 1]|        [192, 168, 3, 1]|VLAN3|   3|[192, 168, 3]|
|  3|192.168.3.1|   192.168.1.1|   [192, 168, 3, 1]|        [192, 168, 1, 1]|VLAN1|   3|[192, 168, 1]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+

最后，您可以在source_ip和destination_ip上將生成的兩個表連接在一起（因為您已根據需要附加了vlan信息），或者您將前兩個步驟合並在一起並crossJoin和filter兩次。

希望這可以幫助！

Pyspark 在 function 的基礎上加入 dataframe

問題描述

1 個解決方案

解決方案1
2 已采納 2020-05-29 14:09:05

Pyspark 在 function 的基礎上加入 dataframe

問題描述

1 個解決方案

解決方案1 2 已采納 2020-05-29 14:09:05

解決方案1
2 已采納 2020-05-29 14:09:05