簡體   English   中英

Spark RDD [String]上的正則表達式,多行上的正則表達式

[英]Regex on Spark RDD[String] with Regex on multiline

我正在嘗試使用Scala在Spark 1.6中解析日志文件,這是示例數據

2017-02-04 04:48:11,123 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:48:20,892 INFO [org.jasig.inspektr.audit.support.Slf4jLoggingAuditTrailManager] - <Audit trail record BEGIN
=============================================================
WHO: audit:unknown
WHAT: TGT-7d937-yRqp6ObM7JOtkUZ7Ff4yEo95-casino1.example.org
ACTION: TICKET_GRANTING_TICKET_DESTROYED
APPLICATION: CASINO
WHEN: Sat Feb 04 04:48:20 AEDT 2017
CLIENT IP ADDRESS: 160.50.201.557
SERVER IP ADDRESS: login.cfu.asg
=============================================================

>
2017-02-04 04:48:32,165 INFO [org.jasig.cas.services.DefaultServicesManagerImpl] - <Reloading registered services.>
2017-02-04 04:48:32,167 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Loaded 2 services.>
2017-02-04 04:48:38,889 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 1 triggers>
2017-02-04 04:48:52,790 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:48:52,790 DEBUG [org.quartz.core.JobRunShell] - <Calling execute on job DEFAULT.serviceRegistryReloaderJobDetail>
2017-02-04 04:48:52,790 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Reloading registered services.>
2017-02-04 04:48:52,792 DEBUG [org.jasig.casino.services.DefaultServicesManagerImpl] - <Adding registered service ^(https?|imaps?)://.*>
2017-02-04 04:48:52,792 DEBUG [org.jasig.casino.services.DefaultServicesManagerImpl] - <Adding registered service
2017-02-04 04:48:52,792 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Loaded 2 services.>
2017-02-04 04:49:14,365 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Reloading registered services.>
2017-02-04 04:49:14,366 INFO [org.jasig.casino.services.DefaultServicesManagerImpl] - <Loaded 2 services.>
2017-02-04 04:49:19,699 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:49:43,465 DEBUG [org.quartz.core.QuartzSchedulerThread] - <batch acquisition of 0 triggers>
2017-02-04 04:50:00,978 INFO [org.jasig.casino.authentication.PolicyBasedAuthenticationManager] - <JaasAuthenticationHandler successfully authenticated >
2017-02-04 04:50:00,978 INFO [org.jasig.casino.authentication.PolicyBasedAuthenticationManager] - <Authenticated 3785973 with credentials.>
2017-02-04 04:50:00,978 INFO [org.jasig.inspektr.nhgij.support.Slf4jLogggbhAuditTrailManaver] - <Audit trail record BEGIN
=============================================================
WHO: z3705z73
WHAT: supplied credentials: [d37c5973]
ACTION: AUTHENTICATION_SUCCESS
APPLICATION: casinoINO
WHEN: Sat Feb 04 04:50:00 AEDT 2017
CLIENT IP ADDRESS: 101.181.28.555
SERVER IP ADDRESS: login.cfu.asg
=============================================================

>

數據繼續運行,這些模式之間可能還有其他日志數據,但這與我的解析無關。 我大約有40GB的文件,每個文件包含一天的數據。

所有這些文件都是gzip壓縮的。 我嘗試使用sc.wholeTextFiles來獲取一對RDD,但是由於每個文件的大小在400mb至800mb之間(未壓縮),因此遇到Java堆空間錯誤。

所以我開始使用sc.textFile並嘗試一種讀取一個文件。 我可以創建一個RDD [String],幸運的是,在此RDD上執行任何操作時,sc.textFile不會返回任何堆空間問題。

這是我嘗試的代碼。

val casinop2 = sc.wholeTextFiles("/logdata/casino/catalina.out-20150228.gz")

val casop = casinop2.flatMap(x=>x.split("\\n")) .filter(x=> !(x.contains("Reloading registered services") || x.contains("Loaded 2 services.") || x.contains("DEBUG") || x.contains("ERROR") || x.contains("java.lang.RuntimeException") || x.contains("Caused by:") || x.contains("Granted ticket") || x.contains("java.lang.IllegalStateException") || x.startsWith("\\t") || x.contains("org.jasig.cas.authentication.PolicyBasedAuthenticationManager") ))

val pattern = new Regex("""((\\d{4})-(\\d{2})-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}),\\d{3}\\s+(\\w+)\\s+\\[(.*)\\]\\s+\\-\\s+\\<.*\\s\\=*\\s+([W][H][O]\\:)\\s+(.*)\\s+([W][H][A][T]\\:)\\s+(.*)\\s+([A][C][T][I][O][N]\\:)\\s+(.*)\\s+([A][P][P][L][I][C][A][T][I][O][N]\\:)\\s+(.*)\\s+([W][H][E][N]\\:)\\s+(.*)\\s+([AZ\\s]{17}\\:)\\s+(.*)\\s+([AZ\\s]{17}\\:)\\s+(.*)\\s+\\=*\\s\\s\\>""") pattern: scala.util.matching.Regex = ((\\d{4})-(\\d{2})-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}),\\d{3}\\s+(\\w+)\\s+\\[(.*)\\]\\s+\\-\\s+\\<.*\\s\\=*\\s+([W][H][O]\\:)\\s+(.*)\\s+([W][H][A][T]\\:)\\s+(.*)\\s+([A][C][T][I][O][N]\\:)\\s+(.*)\\s+([A][P][P][L][I][C][A][T][I][O][N]\\:)\\s+(.*)\\s+([W][H][E][N]\\:)\\s+(.*)\\s+([AZ\\s]{17}\\:)\\s+(.*)\\s+([AZ\\s]{17}\\:)\\s+(.*)\\s+\\=*\\s\\s\\>

case class MLog(datetime: String, message: String, process: String, who: String, what: String, action: String, application: String, when: String, clientipaddress: String, serveripaddress: String,year: String, month: String)

pattern.findAllMatchIn(casop.collect.toString).toList

現在,最后一條語句向我拋出了堆空間錯誤。 我想將rdd轉換為字符串變量的原因是正則表達式需要多行輸入,而不是單行。 對於單行,我將使用地圖,平面地圖等。

我應該從日志文件中獲得的輸出應該是

|2017-02-04 04:54:41|   INFO|org.jasig.inspekt...|     s4542732|supplied credenti...|AUTHENTICATION_SU...|        CAS|Sat Feb 04 04:54:...|  175.163.28.77|login.vu.edu.au|2017|   02|
|2017-02-04 04:54:41|   INFO|org.jasig.inspekt...|     s4542732|TGT-78959-EX63Wf2...|TICKET_GRANTING_T...|        CAS|Sat Feb 04 04:54:...|  175.163.28.77|login.vu.edu.au|2017|   02|
|2017-02-04 04:54:41|   INFO|org.jasig.inspekt...|      4542732|ST-474481-jTxCJFB...|SERVICE_TICKET_CR...|        CAS|Sat Feb 04 04:54:...|  175.163.28.77|login.vu.edu.au|2017|   02|
|2017-02-04 04:54:44|   INFO|org.jasig.inspekt...|audit:unknown|ST-474481-jTxCJFB...|SERVICE_TICKET_VA...|        CAS|Sat Feb 04 04:54:...|  203.13.194.68|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|     s3785573|supplied credenti...|AUTHENTICATION_SU...|        CAS|Sat Feb 04 04:55:...| 101.181.28.125|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|     s3785573|TGT-78960-yWaWkcN...|TICKET_GRANTING_T...|        CAS|Sat Feb 04 04:55:...| 101.181.28.125|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|      3785573|ST-474482-rARxdUG...|SERVICE_TICKET_CR...|        CAS|Sat Feb 04 04:55:...| 101.181.28.125|login.vu.edu.au|2017|   02|
|2017-02-04 04:55:02|   INFO|org.jasig.inspekt...|audit:unknown|ST-474482-rARxdUG...|SERVICE_TICKET_VA...|        CAS|Sat Feb 04 04:55:...|  203.13.194.68|login.vu.edu.au|2017|   02|
+-------------------+-------+--------------------+-------------+--------------------+--------------------+-----------+--------------------+---------------+---------------+----+-----+

我們如何讀取多行輸入並提供給正則表達式?

我已經修復並改進了您的正則表達式,它現在應該可以用於多行的最后一個日志:

正則表達式是以下野獸:

(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}),\d{3}\s+(\w+)\s+\[(.*)\]\s+\-\s+<[^>]*\s\=*\s+WHO\:\s+([^>\n]*)\s+WHAT\:\s+([^>\n]*)\s+ACTION\:\s+([^>\n]*)\s+APPLICATION\:\s+([^>\n]*)\s+WHEN\:\s+([^>\n]*)\s+([A-Z\s]{17}\:)\s+([^>\n]*)\s+([A-Z\s]{17}\:)\s+([^>\n]*)\s+\=*\s\s>

我已經通過使用以下替換模式對您的日志進行了嘗試,您應該根據自己的實際需要對其進行調整:

\1 | \2 | \3 | WHO:\4 | WHAT: \5 | ACTION: \6 | APPLICATION: \7 | WHEN: \8 | \9  $10 | $11  $12

結果如下:

改變之前

變更后

最后但並非最不重要的一點是,您可能必須更改堆大小: --executor-memory 10g executor --executor-memory 10g

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM