Extracting timestamp from string with regex in Spark RDD

Question

I have a log like :

[Pipeline] timestamps
[Pipeline] {
[Pipeline] echo
20:33:05 0
[Pipeline] echo

I am trying to only extract the time information here (20:33:05).

I have tried to do the following:

val lines = sc.textFile("/logs/log7.txt")  
val individualLines=lines.flatMap(_.split("\n")) //Splitting file contentinto individual lines
val dates=individualLines.filter(value=>value.startsWith("[0-9]"))

I am getting the output as

MapPartitionsRDD[3] at filter at DateExtract.scala:30

How should the regex be defined here?

Any help would be much appreciated.

Answer 1

If you have a log files with the data in new line you do not have to split it, you can simply read each line is a String data

Then check if it starts with digit by Character.isDigit this function as below

  val lines = sc.textFile("/logs/log7.txt")
  val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
            .map(_.split(" ")(0))
  dates.foreach(println)

If you want to strictly match the timestamp with regex and filter unmatched then you can use

val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
    .map(_.split(" ")(0))
    .filter(_.matches("""\d{2}:\d{2}:\d{2}"""))

Output:

20:33:05

Hope this helps!

Extracting timestamp from string with regex in Spark RDD

Question

1 answers

solution1
3 ACCPTED 2018-03-06 16:08:34

Extracting timestamp from string with regex in Spark RDD

Question

1 answers

solution1 3 ACCPTED 2018-03-06 16:08:34

solution1
3 ACCPTED 2018-03-06 16:08:34