简体   繁体   中英

Extracting timestamp from string with regex in Spark RDD

I have a log like :

[Pipeline] timestamps
[Pipeline] {
[Pipeline] echo
20:33:05 0
[Pipeline] echo

I am trying to only extract the time information here (20:33:05).

I have tried to do the following:

val lines = sc.textFile("/logs/log7.txt")  
val individualLines=lines.flatMap(_.split("\n")) //Splitting file contentinto individual lines
val dates=individualLines.filter(value=>value.startsWith("[0-9]"))

I am getting the output as

MapPartitionsRDD[3] at filter at DateExtract.scala:30

How should the regex be defined here?

Any help would be much appreciated.

If you have a log files with the data in new line you do not have to split it, you can simply read each line is a String data

Then check if it starts with digit by Character.isDigit this function as below

  val lines = sc.textFile("/logs/log7.txt")
  val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
            .map(_.split(" ")(0))
  dates.foreach(println)

If you want to strictly match the timestamp with regex and filter unmatched then you can use

val dates=lines.filter(value=>Character.isDigit(value.charAt(0)))
    .map(_.split(" ")(0))
    .filter(_.matches("""\d{2}:\d{2}:\d{2}"""))

Output:

20:33:05

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM