有没有办法使用scala过滤不包含spark数据帧中某些内容的字段？

Question

Hopefully I'm stupid and this will be easy. 希望我是愚蠢的，这很容易。

I have a dataframe containing the columns 'url' and 'referrer'. 我有一个包含“url”和“referrer”列的数据框。

I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'. 我想提取包含顶级域名“www.mydomain.com”和“mydomain.co”的所有引荐来源。

I can use 我可以用

val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))

However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. 但是，由于某种原因，这会删除包含我的网域的网址www.google.co.uk搜索网址。 Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have? 有没有办法，在火花中使用scala，我可以用谷歌过滤掉任何东西，同时保持我的正确结果？

Thanks 谢谢

Dean 院长

Answer 1

You can negate predicate using either not or ! 您可以使用not或者否定谓词! so all what's left is to add another condition: 所以剩下的就是添加另一个条件：

import org.apache.spark.sql.functions.not

df.where($"referrer".contains("www.mydomain.") &&
  not($"referrer".contains("google")))

or separate filter: 或单独的过滤器：

df
 .where($"referrer".contains("www.mydomain."))
 .where(!$"referrer".contains("google"))

Answer 2

You may use a Regex . 您可以使用正则Regex 。 Here you can find a reference for the usage of regex in Scala. 在这里，您可以找到Scala中正则表达式用法的参考。 And here you can find some hints about how to create a proper regex for URLs. 在这里，您可以找到有关如何为URL创建正确的正则表达式的一些提示。

Thus in your case you will have something like: 因此，在您的情况下，您将拥有以下内容：

val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
    case Some => true
    case None => false
} )

This solution requires a bit of work but is the safest one. 这个解决方案需要一些工作但是最安全的。

有没有办法使用scala过滤不包含spark数据帧中某些内容的字段？

问题描述

2 个解决方案

解决方案1
18 已采纳 2015-11-09 12:13:25

解决方案2
0 2015-11-09 12:22:13

有没有办法使用scala过滤不包含spark数据帧中某些内容的字段？

问题描述

2 个解决方案

解决方案1 18 已采纳 2015-11-09 12:13:25

解决方案2 0 2015-11-09 12:22:13

解决方案1
18 已采纳 2015-11-09 12:13:25

解决方案2
0 2015-11-09 12:22:13