I have a pyspark.RDD containing dates that I would like to filter out. The dates appear in such form within my RDD:
data.collect() = ["Nujabes","Hip Hop","04:45 16 October 2018"]
I have been trying to filter these out through a regex using:
r"[0-9]{2}:[0-9]{2} [0-9]{2} [A-Z][a-z]+ [0-9]{4}"
but I am doing it the wrong way:
data = data.filter(lambda x: x != r"[0-9]{2}:[0-9]{2} [0-9]{2} [A-Z][a-z]+ [0-9]{4}")
For the given data
above, the desired output would be
data.collect() = ["Nujabes","Hip Hop"]
You can filter with Python regex:
data2 = data.filter(lambda x: not re.match(r"[0-9]{2}:[0-9]{2} [0-9]{2} [A-Z][a-z]+ [0-9]{4}", x))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.