检查IP地址是否在具有Pyspark的IPNetwork中

Question

With Pyspark, I would like to join/merge if an IP address in the dataframe A is in a IP network range or hits the same IP address in the dataframe B. 使用Pyspark，如果数据帧A中的IP地址在IP网络范围内或命中数据帧B中的相同IP地址，我想加入/合并。

The dataframe A contains IP addresses only and the other one has IP addresses or IP addresses with a CIDR. 数据帧A仅包含IP地址，另一个数据帧具有IP地址或带有CIDR的IP地址。 Here is an example. 这是一个例子。

Dataframe A
+---------------+
|     ip_address|
+---------------+
|      192.0.2.2|
|   164.42.155.5|
|    52.95.245.0|
|  66.42.224.235|
|            ...|
+---------------+

Dataframe B
+---------------+
|     ip_address|
+---------------+
| 123.122.213.34|
|    41.32.241.2|
|  66.42.224.235|
|   192.0.2.0/23|
|            ...|
+---------------+

then an expected output is something like below 那么预期的输出如下所示

+---------------+--------+
|     ip_address| is_in_b|
+---------------+--------+
|      192.0.2.2|    true|  -> This is in the same network range as 192.0.2.0/23
|   164.42.155.5|   false|
|    52.95.245.0|   false|
|  66.42.224.235|    true|  -> This is in B
|            ...|     ...|
+---------------+--------+

The idea I first wanted to try is using a udf comparing one by one and checking an IP range when a CIDR comes up but it seems udfs don't multiple dataframes. 我首先想尝试的想法是使用udf逐个比较，并在出现CIDR时检查IP范围，但是udf似乎没有多个数据帧。 I also tried to convert the df B to a list and then compare. 我还尝试将df B转换为列表，然后进行比较。 However, it is very inefficient and takes a long time as the A row number*the B row number is over 100 million. 但是，由于A行数* B行数超过1亿个，因此效率很低并且需要很长时间。 Is there any efficient solution? 有什么有效的解决方案吗？

Edited: For more detailed information, I used the following code to check without pyspark and using any library. 编辑：有关更多详细信息，我使用以下代码在没有pyspark和任何库的情况下进行检查。

def cidr_to_netmask(c):
    cidr = int(c)
    mask = (0xffffffff >> (32 - cidr)) << (32 - cidr)

    return (str((0xff000000 & mask) >> 24) + '.' + str((0x00ff0000 & mask) >> 16) + '.' + str((0x0000ff00 & mask) >> 8) + '.' + str((0x000000ff & mask)))

def ip_to_numeric(ip):
    ip_num = 0
    for i, octet in enumerate(ip.split('.')):
        ip_num += int(octet) << (24 - (8 * i))

    return ip_num

def is_in_ip_network(ip, network_addr):
    if len(network_addr.split('/')) < 2:
        return ip == network_addr.split('/')[0]
    else:
        network_ip, cidr = network_addr.split('/')
        subnet = cidr_to_netmask(cidr)
        return (ip_to_numeric(ip) & ip_to_numeric(subnet)) == (ip_to_numeric(network_ip) & ip_to_numeric(subnet))

Answer 1

You could use crossJoin and udf s, but with a cost of cartesian product 你可以使用crossJoin和udf S，但乘积的成本

from pyspark.sql import *
data_1 = ["192.0.2.2", "164.42.155.5", "52.95.245.0", "66.42.224.235"]
data_2 = ["192.0.2.0/23", "66.42.224.235"]
DF1 = spark.createDataFrame([Row(ip=x) for x in data_1])
DF2 = spark.createDataFrame([Row(ip=x) for x in data_2])

from pyspark.sql.functions import udf
from pyspark.sql.types import *
join_cond = udf(is_in_ip_network, BooleanType())

DF1.crossJoin(DF2).withColumn("match",join_cond(DF1.ip, DF2.ip))

The result looks similar to 结果看起来类似于

ip          ip              match 
192.0.2.2   192.0.2.0/23    true
192.0.2.2   66.42.224.235   false
164.42.155.5    192.0.2.0/23    false
164.42.155.5    66.42.224.235   false
52.95.245.0 192.0.2.0/23    false
52.95.245.0 66.42.224.235   false
66.42.224.235   192.0.2.0/23    false
66.42.224.235   66.42.224.235   true

检查IP地址是否在具有Pyspark的IPNetwork中

问题描述

1 个解决方案

解决方案1
0 2018-06-29 04:49:11

检查IP地址是否在具有Pyspark的IPNetwork中

问题描述

1 个解决方案

解决方案1 0 2018-06-29 04:49:11

解决方案1
0 2018-06-29 04:49:11