[英]Filter spark DataFrame on string contains
I am using Spark 1.3.0 and Spark Avro 1.0.0 .我正在使用Spark 1.3.0和Spark Avro 1.0.0 。 I am working from the example on the repository page .
我正在使用存储库页面上的示例。 This following code works well
以下代码运行良好
val df = sqlContext.read.avro("src/test/resources/episodes.avro")
df.filter("doctor > 5").write.avro("/tmp/output")
But what if I needed to see if the doctor
string contains a substring?但是如果我需要查看
doctor
字符串是否包含子字符串呢? Since we are writing our expression inside of a string.因为我们在字符串中编写我们的表达式。 What do I do to do a "contains"?
我该怎么做才能做到“包含”?
You can use contains
(this works with an arbitrary sequence):您可以使用
contains
(这适用于任意序列):
df.filter($"foo".contains("bar"))
like
(SQL like with SQL simple regular expression whith _
matching an arbitrary character and %
matching an arbitrary sequence): like
(SQL like with SQL simple正则表达式_
匹配任意字符和%
匹配任意序列):
df.filter($"foo".like("bar"))
or rlike
(like with Java regular expressions ):或
rlike
(如Java 正则表达式):
df.filter($"foo".rlike("bar"))
depending on your requirements.取决于您的要求。
LIKE
and RLIKE
should work with SQL expressions as well. LIKE
和RLIKE
也适用于 SQL 表达式。
In pyspark,SparkSql syntax:在pyspark中,SparkSql语法:
where column_n like 'xyz%'
might not work.可能不起作用。
Use:用:
where column_n RLIKE '^xyz'
This works perfectly fine.这工作得很好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.