[英]Is there a way to filter a field not containing something in a spark dataframe using scala?
Hopefully I'm stupid and this will be easy. 希望我是愚蠢的,这很容易。
I have a dataframe containing the columns 'url' and 'referrer'. 我有一个包含“url”和“referrer”列的数据框。
I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'. 我想提取包含顶级域名“www.mydomain.com”和“mydomain.co”的所有引荐来源。
I can use 我可以用
val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))
However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. 但是,由于某种原因,这会删除包含我的网域的网址www.google.co.uk搜索网址。 Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have?
有没有办法,在火花中使用scala,我可以用谷歌过滤掉任何东西,同时保持我的正确结果?
Thanks 谢谢
Dean 院长
You can negate predicate using either not
or !
您可以使用
not
或者否定谓词!
so all what's left is to add another condition: 所以剩下的就是添加另一个条件:
import org.apache.spark.sql.functions.not
df.where($"referrer".contains("www.mydomain.") &&
not($"referrer".contains("google")))
or separate filter: 或单独的过滤器:
df
.where($"referrer".contains("www.mydomain."))
.where(!$"referrer".contains("google"))
You may use a Regex
. 您可以使用正则
Regex
。 Here you can find a reference for the usage of regex in Scala. 在这里,您可以找到Scala中正则表达式用法的参考。 And here you can find some hints about how to create a proper regex for URLs.
在这里,您可以找到有关如何为URL创建正确的正则表达式的一些提示。
Thus in your case you will have something like: 因此,在您的情况下,您将拥有以下内容:
val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
case Some => true
case None => false
} )
This solution requires a bit of work but is the safest one. 这个解决方案需要一些工作但是最安全的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.