简体   繁体   中英

Extract a Path from a server log String using regular expressions in Scala

I have a few logs like below

endeavor.fujitsu.co.jp - - [10/Jul/1995:00:00:15 -0400] "GET /images/ HTTP/1.0" 200 17688                                   
ad13-022.compuserve.com - - [10/Jul/1995:00:00:15 -0400] "GET /history/gemini/gemini-spacecraft.txt HTTP/1.0" 200 651       
pm2-15.magicnet.net - - [10/Jul/1995:00:00:15 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713                        
204.239.199.40 - - [10/Jul/1995:00:00:16 -0400] "GET /shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP/1.0" 200 45970  
pm1-4.tricon.net - - [10/Jul/1995:00:00:17 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669                        
scorpio.digex.net - - [10/Jul/1995:00:00:19 -0400] "GET /history/mercury/mr-3/mr-3.html HTTP/1.0" 200 1124

I need to extract the paths from the above logs. Here is the code that I tried

val pattern = "\\s+([^\\s]+)\\s+HTTP".r
val match = pattern.findFirstIn(log)

Here is the output that I got.

/images/ HTTP
/history/gemini/gemini-spacecraft.txt HTTP
/images/launch-logo.gif HTTP
/shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP
/images/WORLD-logosmall.gif HTTP
/history/mercury/mr-3/mr-3.html HTTP

How do I get rid of HTTP in the above paths?

You're match is in first capturing group,

Alternatively you can use positive lookahead

\\s+[^\\s]+(?=\\s+HTTP)

Demo

在此处输入图片说明

Your match is in the first capturing group () which you might shorten to:

\s(\S+)\s+HTTP 

In Scala

val pattern = "\\s(\\S+)\\s+HTTP".r

Regex demo

You might get the logs using findAllIn:

val pattern = "\\s(\\S+)\\s+HTTP".r
val strings = List(
  """endeavor.fujitsu.co.jp - - [10/Jul/1995:00:00:15 -0400] "GET /images/ HTTP/1.0" 200 17688                                   """,
  """ad13-022.compuserve.com - - [10/Jul/1995:00:00:15 -0400] "GET /history/gemini/gemini-spacecraft.txt HTTP/1.0" 200 651       """,
  """pm2-15.magicnet.net - - [10/Jul/1995:00:00:15 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713                        """,
  """204.239.199.40 - - [10/Jul/1995:00:00:16 -0400] "GET /shuttle/missions/sts-71/images/KSC-95EC-0613.gif HTTP/1.0" 200 45970  """,
  """pm1-4.tricon.net - - [10/Jul/1995:00:00:17 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 200 669                        """,
  """scorpio.digex.net - - [10/Jul/1995:00:00:19 -0400] "GET /history/mercury/mr-3/mr-3.html HTTP/1.0" 200 1124"""
)

strings.foreach { log =>
  val m = pattern.findAllIn(log).group(1)
  println(m)
}

Result

/images/
/history/gemini/gemini-spacecraft.txt
/images/launch-logo.gif
/shuttle/missions/sts-71/images/KSC-95EC-0613.gif
/images/WORLD-logosmall.gif
/history/mercury/mr-3/mr-3.html

Scala demo

To also match this line from the comment:

columbia.acc.brad.ac.uk - - [10/Jul/1995:00:52:36 -0400] "GET /ksc.html" 200 7067

You might use:

\S+ (/(?:[^/\s]+/)*[^\s"]+)

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM