[英]Pyspark join dataframe based on function
I have 2 dataframes that look like that我有 2 个看起来像这样的数据框
networks网络
+----------------+-------+
| Network | VLAN |
+----------------+-------+
| 192.168.1.0/24 | VLAN1 |
| 192.168.2.0/24 | VLAN2 |
+----------------+-------+
flows流动
+--------------+----------------+
| source_ip | destination_ip |
+--------------+----------------+
| 192.168.1.11 | 192.168.2.13 |
+--------------+----------------+
Ideally I would like to get something like this理想情况下,我想得到这样的东西
+--------------+----------------+-------------+------------------+
| source_ip | destination_ip | source_vlan | destination_vlan |
+--------------+----------------+-------------+------------------+
| 192.168.1.11 | 192.168.2.13 | VLAN1 | VLAN2 |
+--------------+----------------+-------------+------------------+
Unfortunately the flows dataframe does not contain the subnetmask.不幸的是,流 dataframe 不包含子网掩码。 What I have tried so far without pyspark
到目前为止我在没有 pyspark 的情况下尝试了什么
ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
I tried to do a similar approach with pyspark but it does not work as well as I think there might be better ways of doing it?我试图用 pyspark 做类似的方法,但它不起作用,因为我认为可能有更好的方法吗?
def get_available_subnets(df):
split_col = split(df['network'], '/')
df = df.withColumn('sub', split_col.getItem(1))
return df.select('sub').distinct()
def get_vlan_by_ip(ip, infoblox, subnets):
for sub in subnets:
net = ipaddress.ip_network('{}/{}'.format(ip,sub), strict=False)
if net:
search = infoblox.filter(infoblox.network == str(net))
if not search.head(1).isEmpty():
return search.first.vlan
return hashlib.sha1(str.encode(ip)).hexdigest()
subnets = get_available_subnets(infoblox_networks_df).select('sub').rdd.flatMap(lambda x: x).collect()
short = flows_filtered_prepared_df.limit(1000)
partial_vlan_func = partial(get_vlan_by_ip, infoblox=infoblox_networks_df, subnets=subnets)
get_vlan_udf = udf(lambda ip: partial_vlan_func(ip), StringType())
short.select('source_ip', 'destination_ip', get_vlan_udf('source_ip').alias('source_vlan')).show()
This method completely avoids the use of udf
, leveraging split
and slice
, but perhaps there is a better way.这种方法完全避免使用
udf
,利用split
和slice
,但也许有更好的方法。 The benefit of this approach is that it directly leverages the bits present in the subnet mask and that it's written purely in PySpark
.这种方法的好处是它直接利用了子网掩码中存在的位,并且它纯粹是在
PySpark
中编写的。
Context for the solution: IP addresses can be split and masked by the subnet.解决方案的上下文: IP 地址可以被子网拆分和屏蔽。 This means that
8, 16, 24, 32
tell you which parts of the IP matter - this motivates the division by 8 and using the resulting column to slice the IP address ArrayType
column once it's split from its original StringType
.这意味着
8, 16, 24, 32
告诉您 IP 的哪些部分很重要 - 这会促使除以 8 并使用结果列将 IP 地址ArrayType
列从其原始StringType
拆分出来。
NB: pyspark.sql.functions.slice
will work in newer version of PySpark >= 2.4
, some older ones need to use f.expr("slice(...)")
.注意:
pyspark.sql.functions.slice
将在较新版本的PySpark >= 2.4
中工作,一些旧版本需要使用f.expr("slice(...)")
The setup:设置:
flows = spark.createDataFrame([
(1, "192.168.1.1", "192.168.2.1"),
(2, "192.168.2.1", "192.168.3.1"),
(3, "192.168.3.1", "192.168.1.1"),
], ['id', 'source_ip', 'destination_ip']
)
networks = spark.createDataFrame([
(1, "192.168.1.0/24", "VLAN1"),
(2, "192.168.2.0/24", "VLAN2"),
(3, "192.168.3.0/24", "VLAN3"),
], ['id', 'network', 'vlan']
)
Some pre-processing:一些预处理:
networks_split = networks.select(
"*",
(f.split(f.col("network"), "/")[1] / 8).cast("int").alias("bits"),
f.split(f.split(f.col("network"), "/")[0], "\.").alias('segmented_ip')
)
networks_split.show()
+---+--------------+-----+----+----------------+
| id| network| vlan|bits| segmented_ip|
+---+--------------+-----+----+----------------+
| 1|192.168.1.0/24|VLAN1| 3|[192, 168, 1, 0]|
| 2|192.168.2.0/24|VLAN2| 3|[192, 168, 2, 0]|
| 3|192.168.3.0/24|VLAN3| 3|[192, 168, 3, 0]|
+---+--------------+-----+----+----------------+
networks_masked = networks_split.select(
"*",
f.expr("slice(segmented_ip, 1, bits)").alias("masked_bits"),
)
networks_masked.show()
+---+--------------+-----+----+----------------+-------------+
| id| network| vlan|bits| segmented_ip| masked_bits|
+---+--------------+-----+----+----------------+-------------+
| 1|192.168.1.0/24|VLAN1| 3|[192, 168, 1, 0]|[192, 168, 1]|
| 2|192.168.2.0/24|VLAN2| 3|[192, 168, 2, 0]|[192, 168, 2]|
| 3|192.168.3.0/24|VLAN3| 3|[192, 168, 3, 0]|[192, 168, 3]|
+---+--------------+-----+----+----------------+-------------+
flows_split = flows.select(
"*",
f.split(f.split(f.col("source_ip"), "/")[0], "\.").alias('segmented_source_ip'),
f.split(f.split(f.col("destination_ip"), "/")[0], "\.").alias('segmented_destination_ip')
)
flows_split.show()
+---+-----------+--------------+-------------------+------------------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip|
+---+-----------+--------------+-------------------+------------------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|
+---+-----------+--------------+-------------------+------------------------+
Finally, I crossJoin
and filter on the slice based on the bits
of my mask, such as:最后,我根据掩码的
bits
对切片进行crossJoin
和过滤,例如:
flows_split.crossJoin(
networks_masked.select("vlan", "bits", "masked_bits")
).where(
f.expr("slice(segmented_source_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits| masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|VLAN1| 3|[192, 168, 1]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|VLAN2| 3|[192, 168, 2]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|VLAN3| 3|[192, 168, 3]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
Exactly the same approach can be done for destination_ip
, such as:可以对
destination_ip
执行完全相同的方法,例如:
flows_split.crossJoin(
networks_masked.select("vlan", "bits", "masked_bits")
).where(
f.expr("slice(segmented_destination_ip, 1, bits)") == f.col("masked_bits")
).show()
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| id| source_ip|destination_ip|segmented_source_ip|segmented_destination_ip| vlan|bits| masked_bits|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
| 1|192.168.1.1| 192.168.2.1| [192, 168, 1, 1]| [192, 168, 2, 1]|VLAN2| 3|[192, 168, 2]|
| 2|192.168.2.1| 192.168.3.1| [192, 168, 2, 1]| [192, 168, 3, 1]|VLAN3| 3|[192, 168, 3]|
| 3|192.168.3.1| 192.168.1.1| [192, 168, 3, 1]| [192, 168, 1, 1]|VLAN1| 3|[192, 168, 1]|
+---+-----------+--------------+-------------------+------------------------+-----+----+-------------+
Finally, you either join the resulting two tables together on source_ip
and destination_ip
(since you have the vlan
information attached as required), or you merge the previous two steps together and crossJoin
and filter
twice.最后,您可以在
source_ip
和destination_ip
上将生成的两个表连接在一起(因为您已根据需要附加了vlan
信息),或者您将前两个步骤合并在一起并crossJoin
和filter
两次。
Hope this helps!希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.