Spark Udf 函数在输入中带有 Dataframe

Question

I have to develop a Spark script with python that checks some logs and verifies if a user has changed the country of his IP between two events.我必须用 python 开发一个 Spark 脚本来检查一些日志并验证用户是否在两个事件之间更改了他的 IP 国家。 I have a csv file with IP ranges and associated countries saved on HDFS like this:我有一个 csv 文件，其中包含保存在 HDFS 上的 IP 范围和相关国家，如下所示：

startIp, endIp, country
0.0.0.0, 10.0.0.0, Italy
10.0.0.1, 20.0.0.0, England
20.0.0.1, 30.0.0.0, Germany

And a log csv file:和一个日志 csv 文件：

userId, timestamp, ip, event
1, 02-01-17 20:45:18, 10.5.10.3, login
24, 02-01-17 20:46:34, 54.23.16.56, login

I load both files with a Spark Dataframe, and I've already modified the one that contains the logs with a lag function adding a column with the previousIp.我使用 Spark Dataframe 加载了这两个文件，并且我已经修改了包含带有滞后函数的日志的文件，并添加了一个带有 previousIp 的列。 The solution I thought is to substitute the ip and previousIp with the associated country in order to compare them and using a dataFrame.filter("previousIp" != "ip").我认为的解决方案是将 ip 和 previousIp 替换为相关国家/地区，以便比较它们并使用 dataFrame.filter("previousIp" != "ip")。 My question is, there is a way to do that in Spark?我的问题是，有没有办法在 Spark 中做到这一点？ Something like:就像是：

dataFrame = dataFrame.select("userId", udfConvert("ip",countryDataFrame).alias("ip"), udfConvert("previousIp",countryDataFrame).alias("previousIp"),...)

In order to have a Dataframe like this:为了拥有这样的数据框：

userId, timestamp, ip, event, previousIp
1, 02-01-17 20:45:18, England, login, Italy

If not, how I can solve my problem?如果没有，我该如何解决我的问题？ Thank you谢谢

Answer 1

It's actually quite easy if you convert IP address to number first.如果您先将 IP 地址转换为数字，实际上很容易。 You can write your own UDF or use code from petrabarus and register function like this:您可以编写自己的 UDF 或使用来自petrabarus 的代码并注册函数，如下所示：

spark.sql("CREATE TEMPORARY FUNCTION iptolong as 'net.petrabarus.hiveudfs.IPToLong'")

Then map countries csv to dataframe with numbers:然后将国家 csv 映射到带有数字的数据框：

>>> ipdb = spark.read.csv('ipdb.csv', header=True).select(
             expr('iptolong(startIp)').alias('ip_from'),
             expr('iptolong(endIp)').alias('ip_to'), 
             'country')
>>> ipdb.show()
+---------+---------+-------+
|  ip_from|    ip_to|country|
+---------+---------+-------+
|        0|167772160|  Italy|
|167772161|335544320|England|
|335544321|503316480|Germany|
+---------+---------+-------+

Also, map your log dataframe to numbers:此外，将您的日志数据框映射到数字：

>>> log = spark.createDataFrame([('15.0.0.1',)], ['ip']) \
            .withColumn('ip', expr('iptolong(ip)'))
>>> log.show()
+---------+
|       ip|
+---------+
|251658241|
+---------+

Then you can join this dataframe using between condition:然后您可以使用条件between加入此数据框：

>>> log.join(broadcast(ipdb), log.ip.between(ipdb.ip_from, ipdb.ip_to)).show()
+---------+---------+---------+-------+
|       ip|  ip_from|    ip_to|country|
+---------+---------+---------+-------+
|251658241|167772161|335544320|England|
+---------+---------+---------+-------+

Spark Udf 函数在输入中带有 Dataframe

问题描述

1 个解决方案

解决方案1
1 2017-01-11 15:31:23

Spark Udf 函数在输入中带有 Dataframe

问题描述

1 个解决方案

解决方案1 1 2017-01-11 15:31:23

解决方案1
1 2017-01-11 15:31:23