简体   繁体   English

有没有办法使用scala过滤不包含spark数据帧中某些内容的字段?

[英]Is there a way to filter a field not containing something in a spark dataframe using scala?

Hopefully I'm stupid and this will be easy. 希望我是愚蠢的,这很容易。

I have a dataframe containing the columns 'url' and 'referrer'. 我有一个包含“url”和“referrer”列的数据框。

I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'. 我想提取包含顶级域名“www.mydomain.com”和“mydomain.co”的所有引荐来源。

I can use 我可以用

val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))

However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. 但是,由于某种原因,这会删除包含我的网域的网址www.google.co.uk搜索网址。 Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have? 有没有办法,在火花中使用scala,我可以用谷歌过滤掉任何东西,同时保持我的正确结果?

Thanks 谢谢

Dean 院长

You can negate predicate using either not or ! 您可以使用not或者否定谓词! so all what's left is to add another condition: 所以剩下的就是添加另一个条件:

import org.apache.spark.sql.functions.not

df.where($"referrer".contains("www.mydomain.") &&
  not($"referrer".contains("google")))

or separate filter: 或单独的过滤器:

df
 .where($"referrer".contains("www.mydomain."))
 .where(!$"referrer".contains("google"))

You may use a Regex . 您可以使用正则Regex Here you can find a reference for the usage of regex in Scala. 在这里,您可以找到Scala中正则表达式用法的参考。 And here you can find some hints about how to create a proper regex for URLs. 在这里,您可以找到有关如何为URL创建正确的正则表达式的一些提示。

Thus in your case you will have something like: 因此,在您的情况下,您将拥有以下内容:

val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
    case Some => true
    case None => false
} )

This solution requires a bit of work but is the safest one. 这个解决方案需要一些工作但是最安全的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM