![](/img/trans.png)
[英]Pyspark DataFrame Filter column based on a column in another DataFrame without join
[英]Pyspark join dataframe based on function
我有 2 个看起来像这样的数据框
网络
+----------------+-------+
| Network | VLAN |
+----------------+-------+
| 192.168.1.0/24 | VLAN1 |
| 192.168.2.0/24 | VLAN2 |
+----------------+-------+
流动
+--------------+----------------+
| source_ip | destination_ip |
+--------------+----------------+
| 192.168.1.11 | 192.168.2.13 |
+--------------+----------------+
理想情况下,我想得到这样的东西
+--------------+----------------+-------------+------------------+
| source_ip | destination_ip | source_vlan | destination_vlan |
+--------------+----------------+-------------+------------------+
| 192.168.1.11 | 192.168.2.13 | VLAN1 | VLAN2 |
+--------------+----------------+-------------+------------------+
不幸的是,流 dataframe 不包含子网掩码。 到目前为止我在没有 pyspark 的情况下尝试了什么
ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
我试图用 pyspark 做类似的方法,但它不起作用,因为我认为可能有更好的方法吗?
def get_available_subnets(df):
split_col = split(df['network'], '/')
df = df.withColumn('sub', split_col.getItem(1))
return df.select('sub').distinct()
def get_vlan_by_ip(ip, infoblox, subnets):
for sub in subnets:
net = ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
if net:
search = infoblox.filter(infoblox.network == str(net))
if not search.head(1).isEmpty():
return search.first.vlan
return hashlib.sha1(str.encode(ip)).hexdigest()
subnets = get_available_subnets(infoblox_networks_df).select('sub').rdd.flatMap(lambda x: x).collect()
short = flows_filtered_prepared_df.limit(1000)
partial_vlan_func = partial(get_vlan_by_ip, infoblox=infoblox_networks_df, subnets=subnets)
get_vlan_udf = udf(lambda ip: partial_vlan_func(ip), StringType())
short.select('source_ip', 'destination_ip', get_vlan_udf('source_ip').alias('source_vlan')).show()
这种方法完全避免使用udf
,利用split
和slice
,但也许有更好的方法。 这种方法的好处是它直接利用了子网掩码中存在的位,并且它纯粹是在PySpark
中编写的。
解决方案的上下文: IP 地址可以被子网拆分和屏蔽。 这意味着8, 16, 24, 32
告诉您 IP 的哪些部分很重要 - 这会促使除以 8 并使用结果列将 IP 地址ArrayType
列从其原始StringType
拆分出来。
注意: pyspark.sql.functions.slice
将在较新版本的PySpark >= 2.4
中工作,一些旧版本需要使用f.expr("slice(...)")
设置:
flows = spark.createDataFrame([
(1, "192.168.1.1", "192.168.2.1"),
(2, "192.168.2.1", "192.168.3.1"),
(3, "192.168.3.1", "192.168.1.1"),
], ['id', 'source_ip', 'destination_ip']
)
networks = spark.createDataFrame([
(1, "192.168.1.0/24", "VLAN1"),
(2, "192.168.2.0/24", "VLAN2"),
(3, "192.168.3.0/24", "VLAN3"),
], ['id', 'network', 'vlan']
)
一些预处理:
networks_split = networks.select(
"*",
(f.split(f.col("network"), "/")[1] / 8).cast("int").alias("bits"),
f.split(f.split(f.col("network"), "/")[0], "\.").alias('segmented_ip')
)
networks_split.show()
+---+--------------+-----+----+----------------+
| id| network| vlan|bits| segmented_ip|
+---+--------------+-----+----+----------------+
| 1|192.168.1.0/24|VLAN1| 3|[192, 168, 1, 0]|
| 2|192.168.2.0/24|VLAN2| 3|[192, 168, 2, 0]|
| 3|192.168.3.0/24|VLAN3| 3|[192, 168, 3, 0]|
+---+--------------+-----+----+----------------+
networks_masked = networks_split.select(
"*",
f.expr("slice(segmented_ip, 1, bits)").alias("masked_bits"),
)
networks_masked.show()
+---+--------------+-----+----+----------------+-------------+
| id| network| vlan|bits| segmented_ip| masked_bits|
+---+--------------+-----+----+----------------+-------------+
| 1|192.168.1.0/24|VLAN1| 3|[192, 168, 1, 0]|[192, 168, 1]|
| 2|192.168.2.0/24|VLAN2| 3|[192, 168, 2, 0]|[192, 168, 2]|
| 3|192.168.3.0/24|VLAN3| 3|[192, 168, 3, 0]|[192, 168, 3]|
+---+--------------+-----+----+----------------+-------------+
flows_split = flows.select(
"*",
f.split(f.split(f.col("source_ip"), "/")[0], "\.").alias('segmented_source_ip'),
f.split(f.split(f.col("destination_ip"), "/")[0], "\.").alias('segmented_destination_ip')
)
flows_split.show()
+---+-----------+--------------+-------------------+------------------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip|
+---+-----------+--------------+-------------------+------------------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|
+---+-----------+--------------+-------------------+------------------------+
最后,我根据掩码的bits
对切片进行crossJoin
和过滤,例如:
flows_split.crossJoin(
networks_masked.select("vlan", "bits", "masked_bits")
).where(
f.expr("slice(segmented_source_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits| masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|VLAN1| 3|[192, 168, 1]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|VLAN2| 3|[192, 168, 2]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|VLAN3| 3|[192, 168, 3]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
可以对destination_ip
执行完全相同的方法,例如:
flows_split.crossJoin(
networks_masked.select("vlan", "bits", "masked_bits")
).where(
f.expr("slice(segmented_destination_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits| masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|VLAN2| 3|[192, 168, 2]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|VLAN3| 3|[192, 168, 3]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|VLAN1| 3|[192, 168, 1]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
最后,您可以在source_ip
和destination_ip
上将生成的两个表连接在一起(因为您已根据需要附加了vlan
信息),或者您将前两个步骤合并在一起并crossJoin
和filter
两次。
希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.