简体   繁体   English

Spark Udf 函数在输入中带有 Dataframe

[英]Spark Udf function with Dataframe in input

I have to develop a Spark script with python that checks some logs and verifies if a user has changed the country of his IP between two events.我必须用 python 开发一个 Spark 脚本来检查一些日志并验证用户是否在两个事件之间更改了他的 IP 国家。 I have a csv file with IP ranges and associated countries saved on HDFS like this:我有一个 csv 文件,其中包含保存在 HDFS 上的 IP 范围和相关国家,如下所示:

startIp, endIp, country
0.0.0.0, 10.0.0.0, Italy
10.0.0.1, 20.0.0.0, England
20.0.0.1, 30.0.0.0, Germany

And a log csv file:和一个日志 csv 文件:

userId, timestamp, ip, event
1, 02-01-17 20:45:18, 10.5.10.3, login
24, 02-01-17 20:46:34, 54.23.16.56, login

I load both files with a Spark Dataframe, and I've already modified the one that contains the logs with a lag function adding a column with the previousIp.我使用 Spark Dataframe 加载了这两个文件,并且我已经修改了包含带有滞后函数的日志的文件,并添加了一个带有 previousIp 的列。 The solution I thought is to substitute the ip and previousIp with the associated country in order to compare them and using a dataFrame.filter("previousIp" != "ip").我认为的解决方案是将 ip 和 previousIp 替换为相关国家/地区,以便比较它们并使用 dataFrame.filter("previousIp" != "ip")。 My question is, there is a way to do that in Spark?我的问题是,有没有办法在 Spark 中做到这一点? Something like:就像是:

dataFrame = dataFrame.select("userId", udfConvert("ip",countryDataFrame).alias("ip"), udfConvert("previousIp",countryDataFrame).alias("previousIp"),...)

In order to have a Dataframe like this:为了拥有这样的数据框:

userId, timestamp, ip, event, previousIp
1, 02-01-17 20:45:18, England, login, Italy

If not, how I can solve my problem?如果没有,我该如何解决我的问题? Thank you谢谢

It's actually quite easy if you convert IP address to number first.如果您先将 IP 地址转换为数字,实际上很容易。 You can write your own UDF or use code from petrabarus and register function like this:您可以编写自己的 UDF 或使用来自petrabarus 的代码并注册函数,如下所示:

spark.sql("CREATE TEMPORARY FUNCTION iptolong as 'net.petrabarus.hiveudfs.IPToLong'")

Then map countries csv to dataframe with numbers:然后将国家 csv 映射到带有数字的数据框:

>>> ipdb = spark.read.csv('ipdb.csv', header=True).select(
             expr('iptolong(startIp)').alias('ip_from'),
             expr('iptolong(endIp)').alias('ip_to'), 
             'country')
>>> ipdb.show()
+---------+---------+-------+
|  ip_from|    ip_to|country|
+---------+---------+-------+
|        0|167772160|  Italy|
|167772161|335544320|England|
|335544321|503316480|Germany|
+---------+---------+-------+

Also, map your log dataframe to numbers:此外,将您的日志数据框映射到数字:

>>> log = spark.createDataFrame([('15.0.0.1',)], ['ip']) \
            .withColumn('ip', expr('iptolong(ip)'))
>>> log.show()
+---------+
|       ip|
+---------+
|251658241|
+---------+

Then you can join this dataframe using between condition:然后您可以使用条件between加入此数据框:

>>> log.join(broadcast(ipdb), log.ip.between(ipdb.ip_from, ipdb.ip_to)).show()
+---------+---------+---------+-------+
|       ip|  ip_from|    ip_to|country|
+---------+---------+---------+-------+
|251658241|167772161|335544320|England|
+---------+---------+---------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM