![](/img/trans.png)
[英]Pyspark DataFrame Filter column based on a column in another DataFrame without join
[英]Pyspark join dataframe based on function
我有 2 個看起來像這樣的數據框
網絡
+----------------+-------+
| Network | VLAN |
+----------------+-------+
| 192.168.1.0/24 | VLAN1 |
| 192.168.2.0/24 | VLAN2 |
+----------------+-------+
流動
+--------------+----------------+
| source_ip | destination_ip |
+--------------+----------------+
| 192.168.1.11 | 192.168.2.13 |
+--------------+----------------+
理想情況下,我想得到這樣的東西
+--------------+----------------+-------------+------------------+
| source_ip | destination_ip | source_vlan | destination_vlan |
+--------------+----------------+-------------+------------------+
| 192.168.1.11 | 192.168.2.13 | VLAN1 | VLAN2 |
+--------------+----------------+-------------+------------------+
不幸的是,流 dataframe 不包含子網掩碼。 到目前為止我在沒有 pyspark 的情況下嘗試了什么
ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
我試圖用 pyspark 做類似的方法,但它不起作用,因為我認為可能有更好的方法嗎?
def get_available_subnets(df):
split_col = split(df['network'], '/')
df = df.withColumn('sub', split_col.getItem(1))
return df.select('sub').distinct()
def get_vlan_by_ip(ip, infoblox, subnets):
for sub in subnets:
net = ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
if net:
search = infoblox.filter(infoblox.network == str(net))
if not search.head(1).isEmpty():
return search.first.vlan
return hashlib.sha1(str.encode(ip)).hexdigest()
subnets = get_available_subnets(infoblox_networks_df).select('sub').rdd.flatMap(lambda x: x).collect()
short = flows_filtered_prepared_df.limit(1000)
partial_vlan_func = partial(get_vlan_by_ip, infoblox=infoblox_networks_df, subnets=subnets)
get_vlan_udf = udf(lambda ip: partial_vlan_func(ip), StringType())
short.select('source_ip', 'destination_ip', get_vlan_udf('source_ip').alias('source_vlan')).show()
這種方法完全避免使用udf
,利用split
和slice
,但也許有更好的方法。 這種方法的好處是它直接利用了子網掩碼中存在的位,並且它純粹是在PySpark
中編寫的。
解決方案的上下文: IP 地址可以被子網拆分和屏蔽。 這意味着8, 16, 24, 32
告訴您 IP 的哪些部分很重要 - 這會促使除以 8 並使用結果列將 IP 地址ArrayType
列從其原始StringType
拆分出來。
注意: pyspark.sql.functions.slice
將在較新版本的PySpark >= 2.4
中工作,一些舊版本需要使用f.expr("slice(...)")
設置:
flows = spark.createDataFrame([
(1, "192.168.1.1", "192.168.2.1"),
(2, "192.168.2.1", "192.168.3.1"),
(3, "192.168.3.1", "192.168.1.1"),
], ['id', 'source_ip', 'destination_ip']
)
networks = spark.createDataFrame([
(1, "192.168.1.0/24", "VLAN1"),
(2, "192.168.2.0/24", "VLAN2"),
(3, "192.168.3.0/24", "VLAN3"),
], ['id', 'network', 'vlan']
)
一些預處理:
networks_split = networks.select(
"*",
(f.split(f.col("network"), "/")[1] / 8).cast("int").alias("bits"),
f.split(f.split(f.col("network"), "/")[0], "\.").alias('segmented_ip')
)
networks_split.show()
+---+--------------+-----+----+----------------+
| id| network| vlan|bits| segmented_ip|
+---+--------------+-----+----+----------------+
| 1|192.168.1.0/24|VLAN1| 3|[192, 168, 1, 0]|
| 2|192.168.2.0/24|VLAN2| 3|[192, 168, 2, 0]|
| 3|192.168.3.0/24|VLAN3| 3|[192, 168, 3, 0]|
+---+--------------+-----+----+----------------+
networks_masked = networks_split.select(
"*",
f.expr("slice(segmented_ip, 1, bits)").alias("masked_bits"),
)
networks_masked.show()
+---+--------------+-----+----+----------------+-------------+
| id| network| vlan|bits| segmented_ip| masked_bits|
+---+--------------+-----+----+----------------+-------------+
| 1|192.168.1.0/24|VLAN1| 3|[192, 168, 1, 0]|[192, 168, 1]|
| 2|192.168.2.0/24|VLAN2| 3|[192, 168, 2, 0]|[192, 168, 2]|
| 3|192.168.3.0/24|VLAN3| 3|[192, 168, 3, 0]|[192, 168, 3]|
+---+--------------+-----+----+----------------+-------------+
flows_split = flows.select(
"*",
f.split(f.split(f.col("source_ip"), "/")[0], "\.").alias('segmented_source_ip'),
f.split(f.split(f.col("destination_ip"), "/")[0], "\.").alias('segmented_destination_ip')
)
flows_split.show()
+---+-----------+--------------+-------------------+------------------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip|
+---+-----------+--------------+-------------------+------------------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|
+---+-----------+--------------+-------------------+------------------------+
最后,我根據掩碼的bits
對切片進行crossJoin
和過濾,例如:
flows_split.crossJoin(
networks_masked.select("vlan", "bits", "masked_bits")
).where(
f.expr("slice(segmented_source_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits| masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|VLAN1| 3|[192, 168, 1]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|VLAN2| 3|[192, 168, 2]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|VLAN3| 3|[192, 168, 3]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
可以對destination_ip
執行完全相同的方法,例如:
flows_split.crossJoin(
networks_masked.select("vlan", "bits", "masked_bits")
).where(
f.expr("slice(segmented_destination_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits| masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|VLAN2| 3|[192, 168, 2]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|VLAN3| 3|[192, 168, 3]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|VLAN1| 3|[192, 168, 1]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
最后,您可以在source_ip
和destination_ip
上將生成的兩個表連接在一起(因為您已根據需要附加了vlan
信息),或者您將前兩個步驟合並在一起並crossJoin
和filter
兩次。
希望這可以幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.