[英]Filter a pyspark.RDD with regex
I have a pyspark.RDD containing dates that I would like to filter out.我有一个 pyspark.RDD 包含我想过滤掉的日期。 The dates appear in such form within my RDD:
日期以这种形式出现在我的 RDD 中:
data.collect() = ["Nujabes","Hip Hop","04:45 16 October 2018"]
I have been trying to filter these out through a regex using:我一直在尝试通过正则表达式过滤掉这些:
r"[0-9]{2}:[0-9]{2} [0-9]{2} [A-Z][a-z]+ [0-9]{4}"
but I am doing it the wrong way:但我做错了:
data = data.filter(lambda x: x != r"[0-9]{2}:[0-9]{2} [0-9]{2} [A-Z][a-z]+ [0-9]{4}")
For the given data
above, the desired output would be对于上面给定的
data
,所需的 output 将是
data.collect() = ["Nujabes","Hip Hop"]
You can filter with Python regex:您可以使用 Python 正则表达式进行过滤:
data2 = data.filter(lambda x: not re.match(r"[0-9]{2}:[0-9]{2} [0-9]{2} [A-Z][a-z]+ [0-9]{4}", x))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.