how to create a generic regular expression so that all group result can be extract in scala spark

Question

We have .txt log file , i used scala spark to read the file. the file contains sets of data in row wise . i read the data one by one like as below

val sc = spark.SparkContext
val dataframe = sc.textFile(/path/to/log/*.txt)

Mainly the data in all logs file is three type like one of them as below

ManagedElement=LNJ05193B,ENodeBFunction=1,RadioBearerTable=default,DataRadioBearer=1 dlMaxRetxThreshold 8   LNJ05193B   dlMaxRetxThreshold  8
ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=DNJ05024D31 enableServiceSpecificHARQ false DNJ05024D31 enableServiceSpecificHARQ   FALSE
ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=LNJ05024D31 primaryUpperLayerInd OFF    LNJ05024D31 primaryUpperLayerInd    OFF

and second type of line are this type

ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=BNJ05024D31,EUtranFreqRelation=5035 connectedModeMobilityPrio 7 LNJ05024D   5035    connectedModeMobilityPrio

and some raw line are as below:

ManagedElement=LNJ05147D,ENodeBFunction=1,EUtranCellFDD=LNJ05147D11,EUtranFreqRelation=2250,EUtranCellRelation=310260-51992-1 cellIndividualOffsetEUtran 0  LNJ05147D11 2250    310260  cellIndividualOffsetEUtran  0

I try to make a common csv file that contain all of the above record like as below

In all type of line the common part is Mana= and ENF= so to get this is used regular expression like

val regx_first_exp = """"Manag=(\w*).*ENF=(\w),.*""".r

The last two words are the key value can be extract like as below

val last_two = """(\w+)=(\w+[^=])"""".r

and in between i try to extract the value after eqal to( =sign) in different and want to put in different columns if there is no match than simple put null value in the particular columns.

The final out like :

+--------------+-----------+---------------+--------------+--------------------------+----------+
|managedElement|cellFDD    |targetFrequency|targetCell    |paramName                 |paramValue|
+--------------+-----------+---------------+--------------+--------------------------+----------+
|LNJ05025D     |DNJ05025D31|AWS_2087       |null          |threshXHighQ              |0         |
|LNJ05024D     |BNJ05024D31|5035           |null          |connectedModeMobilityPrio |7         |
|LNJ05193B     |null       |null           |null          |dlMaxRetxThreshold        |8         |
|LNJ05024D     |DNJ05024D31|null           |null          |enableServiceSpecificHARQ |false     |
|LNJ05024D     |LNJ05024D31|null           |null          |primaryUpperLayerInd      |OFF       |
|LNJ05147D     |LNJ05147D11|2250           |310260-51992-1|cellIndividualOffsetEUtran|0         |
+--------------+-----------+---------------+--------------+--------------------------+----------+

Is this can we possible in single regex or multiple udf function as much as minimum filter?

I am new in scala, please provide the suggestion for the same. the last column in the image is just for type of rows as mentioned one by one.

Answer 1

Here is one solution that works with 4 different regex expressions using pattern matching with regular expressions as explained here :

val df = Seq(
 ("ManagedElement=LNJ05025D,ENodeBFunction=1,EUtranCellFDD=DNJ05025D31,UtranFreqRelation=AWS_2087 threshXHighQ 0"),
 ("ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=BNJ05024D31,EUtranFreqRelation=5035 connectedModeMobilityPrio 7"),
 ("ManagedElement=LNJ05193B,ENodeBFunction=1,RadioBearerTable=default,DataRadioBearer=1 dlMaxRetxThreshold 8"),
 ("ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=DNJ05024D31 enableServiceSpecificHARQ false"),
 ("ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=LNJ05024D31 primaryUpperLayerInd OFF"),
 ("ManagedElement=LNJ05147D,ENodeBFunction=1,EUtranCellFDD=LNJ05147D11,EUtranFreqRelation=2250,EUtranCellRelation=310260-51992-1 cellIndividualOffsetEUtran 0")
).toDF("logs")

case class LogItem(managedElement: String, cellFDD: String, targetFrequency: String, targetCell: String, paramName: String, paramValue: String)

// 1st type: ManagedElement=LNJ05025D,ENodeBFunction=1,EUtranCellFDD=DNJ05025D31,UtranFreqRelation=AWS_2087 threshXHighQ 0
// extract 5 groups
val log1RegExpr = """^ManagedElement=(\w+).*EUtranCellFDD=(\w+).*tranFreqRelation=(\w+)\s(\w+)\s(\w+)$""".r

// 2nd type: ManagedElement=LNJ05193B,ENodeBFunction=1,RadioBearerTable=default,DataRadioBearer=1 dlMaxRetxThreshold 8
// extract 3 groups
val log2RegExpr = """^ManagedElement=(\w+).*\s(\w+)\s(\w+)$""".r

// 3rd type: ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=DNJ05024D31 enableServiceSpecificHARQ false
// extract 4 groups
val log3RegExpr = """^ManagedElement=(\w+).*EUtranCellFDD=(\w+)\s(\w+)\s(\w+)$""".r

// 4th type: ManagedElement=LNJ05147D,ENodeBFunction=1,EUtranCellFDD=LNJ05147D11,EUtranFreqRelation=2250,EUtranCellRelation=310260-51992-1 cellIndividualOffsetEUtran 0
// extract 6 groups
val log4RegExpr = """^ManagedElement=(\w+).*EUtranCellFDD=(\w+).*tranFreqRelation=(\w+).*EUtranCellRelation=(\S+)\s(\w+)\s(\w+)$""".r

df.map{row =>
  row.getString(0) match {
    case log4RegExpr(me, cf, tf, tc, pn, pv) => LogItem(me, cf, tf, tc, pn, pv)
    case log1RegExpr(me, cf, tf, pn, pv) => LogItem(me, cf, tf, null, pn, pv)
    case log3RegExpr(me, cf, pn, pv) => LogItem(me, cf, null, null, pn, pv)
    case log2RegExpr(me, pn, pv) => LogItem(me, null, null, null, pn, pv)
    case _ => throw new Exception("Invalid format")
  }
}.show(false)

And the output:

+--------------+-----------+---------------+--------------+--------------------------+----------+
|managedElement|cellFDD    |targetFrequency|targetCell    |paramName                 |paramValue|
+--------------+-----------+---------------+--------------+--------------------------+----------+
|LNJ05025D     |DNJ05025D31|AWS_2087       |null          |threshXHighQ              |0         |
|LNJ05024D     |BNJ05024D31|5035           |null          |connectedModeMobilityPrio |7         |
|LNJ05193B     |null       |null           |null          |dlMaxRetxThreshold        |8         |
|LNJ05024D     |DNJ05024D31|null           |null          |enableServiceSpecificHARQ |false     |
|LNJ05024D     |LNJ05024D31|null           |null          |primaryUpperLayerInd      |OFF       |
|LNJ05147D     |LNJ05147D11|2250           |310260-51992-1|cellIndividualOffsetEUtran|0         |
+--------------+-----------+---------------+--------------+--------------------------+----------+

As you can see we return an instance of the case class LogItem after matching one of the given expressions.

Two things to notice here:

You should be cautious to keep the order of matching cases as specified above, from the larger (more matches to extract) to the smaller (less matches) otherwise a log4 can fall under the category log2 since there is still a match!
From your examples it seems that EUtranCellRelation contains special characters therefore \\S+ (non space char) is required instead of \\w .

how to create a generic regular expression so that all group result can be extract in scala spark

Question

1 answers

solution1
1 ACCPTED 2020-02-01 19:22:24

how to create a generic regular expression so that all group result can be extract in scala spark

Question

1 answers

solution1 1 ACCPTED 2020-02-01 19:22:24

solution1
1 ACCPTED 2020-02-01 19:22:24