简体   繁体   English

如何创建通用正则表达式,以便可以在 scala spark 中提取所有组结果

[英]how to create a generic regular expression so that all group result can be extract in scala spark

We have .txt log file , i used scala spark to read the file.我们有 .txt 日志文件,我使用 scala spark 读取文件。 the file contains sets of data in row wise .该文件包含按行排列的数据集。 i read the data one by one like as below我一一读取数据,如下所示

val sc = spark.SparkContext
val dataframe = sc.textFile(/path/to/log/*.txt)

Mainly the data in all logs file is three type like one of them as below所有日志文件中的数据主要是三种类型,其中一种如下

ManagedElement=LNJ05193B,ENodeBFunction=1,RadioBearerTable=default,DataRadioBearer=1 dlMaxRetxThreshold 8   LNJ05193B   dlMaxRetxThreshold  8
ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=DNJ05024D31 enableServiceSpecificHARQ false DNJ05024D31 enableServiceSpecificHARQ   FALSE
ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=LNJ05024D31 primaryUpperLayerInd OFF    LNJ05024D31 primaryUpperLayerInd    OFF

and second type of line are this type第二种类型的线是这种类型

ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=BNJ05024D31,EUtranFreqRelation=5035 connectedModeMobilityPrio 7 LNJ05024D   5035    connectedModeMobilityPrio

and some raw line are as below:和一些原始线如下:

ManagedElement=LNJ05147D,ENodeBFunction=1,EUtranCellFDD=LNJ05147D11,EUtranFreqRelation=2250,EUtranCellRelation=310260-51992-1 cellIndividualOffsetEUtran 0  LNJ05147D11 2250    310260  cellIndividualOffsetEUtran  0

I try to make a common csv file that contain all of the above record like as below我尝试制作一个包含所有上述记录的通用 csv 文件,如下所示

In all type of line the common part is Mana= and ENF= so to get this is used regular expression like在所有类型的行中,公共部分是 Mana= 和 ENF= 因此使用正则表达式如

val regx_first_exp = """"Manag=(\w*).*ENF=(\w),.*""".r

The last two words are the key value can be extract like as below最后两个词是键值,可以像下面这样提取

val last_two = """(\w+)=(\w+[^=])"""".r

and in between i try to extract the value after eqal to( =sign) in different and want to put in different columns if there is no match than simple put null value in the particular columns.在两者之间,我尝试在不同的 eqal to( =sign) 之后提取值,如果没有匹配,则希望放入不同的列,而不是在特定列中简单放置空值。

The final out like :最终结果如下:

+--------------+-----------+---------------+--------------+--------------------------+----------+
|managedElement|cellFDD    |targetFrequency|targetCell    |paramName                 |paramValue|
+--------------+-----------+---------------+--------------+--------------------------+----------+
|LNJ05025D     |DNJ05025D31|AWS_2087       |null          |threshXHighQ              |0         |
|LNJ05024D     |BNJ05024D31|5035           |null          |connectedModeMobilityPrio |7         |
|LNJ05193B     |null       |null           |null          |dlMaxRetxThreshold        |8         |
|LNJ05024D     |DNJ05024D31|null           |null          |enableServiceSpecificHARQ |false     |
|LNJ05024D     |LNJ05024D31|null           |null          |primaryUpperLayerInd      |OFF       |
|LNJ05147D     |LNJ05147D11|2250           |310260-51992-1|cellIndividualOffsetEUtran|0         |
+--------------+-----------+---------------+--------------+--------------------------+----------+

Is this can we possible in single regex or multiple udf function as much as minimum filter?这是否可以在单个正则表达式或多个 udf 函数中与最小过滤器一样多?

I am new in scala, please provide the suggestion for the same.我是 Scala 的新手,请提供相同的建议。 the last column in the image is just for type of rows as mentioned one by one.图像中的最后一列仅用于一一提到的行类型。

Here is one solution that works with 4 different regex expressions using pattern matching with regular expressions as explained here :这里是一个解决方案,使用模式匹配正则表达式的4个不同的正则表达式表达作品解释在这里

val df = Seq(
 ("ManagedElement=LNJ05025D,ENodeBFunction=1,EUtranCellFDD=DNJ05025D31,UtranFreqRelation=AWS_2087 threshXHighQ 0"),
 ("ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=BNJ05024D31,EUtranFreqRelation=5035 connectedModeMobilityPrio 7"),
 ("ManagedElement=LNJ05193B,ENodeBFunction=1,RadioBearerTable=default,DataRadioBearer=1 dlMaxRetxThreshold 8"),
 ("ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=DNJ05024D31 enableServiceSpecificHARQ false"),
 ("ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=LNJ05024D31 primaryUpperLayerInd OFF"),
 ("ManagedElement=LNJ05147D,ENodeBFunction=1,EUtranCellFDD=LNJ05147D11,EUtranFreqRelation=2250,EUtranCellRelation=310260-51992-1 cellIndividualOffsetEUtran 0")
).toDF("logs")

case class LogItem(managedElement: String, cellFDD: String, targetFrequency: String, targetCell: String, paramName: String, paramValue: String)

// 1st type: ManagedElement=LNJ05025D,ENodeBFunction=1,EUtranCellFDD=DNJ05025D31,UtranFreqRelation=AWS_2087 threshXHighQ 0
// extract 5 groups
val log1RegExpr = """^ManagedElement=(\w+).*EUtranCellFDD=(\w+).*tranFreqRelation=(\w+)\s(\w+)\s(\w+)$""".r

// 2nd type: ManagedElement=LNJ05193B,ENodeBFunction=1,RadioBearerTable=default,DataRadioBearer=1 dlMaxRetxThreshold 8
// extract 3 groups
val log2RegExpr = """^ManagedElement=(\w+).*\s(\w+)\s(\w+)$""".r

// 3rd type: ManagedElement=LNJ05024D,ENodeBFunction=1,EUtranCellFDD=DNJ05024D31 enableServiceSpecificHARQ false
// extract 4 groups
val log3RegExpr = """^ManagedElement=(\w+).*EUtranCellFDD=(\w+)\s(\w+)\s(\w+)$""".r

// 4th type: ManagedElement=LNJ05147D,ENodeBFunction=1,EUtranCellFDD=LNJ05147D11,EUtranFreqRelation=2250,EUtranCellRelation=310260-51992-1 cellIndividualOffsetEUtran 0
// extract 6 groups
val log4RegExpr = """^ManagedElement=(\w+).*EUtranCellFDD=(\w+).*tranFreqRelation=(\w+).*EUtranCellRelation=(\S+)\s(\w+)\s(\w+)$""".r

df.map{row =>
  row.getString(0) match {
    case log4RegExpr(me, cf, tf, tc, pn, pv) => LogItem(me, cf, tf, tc, pn, pv)
    case log1RegExpr(me, cf, tf, pn, pv) => LogItem(me, cf, tf, null, pn, pv)
    case log3RegExpr(me, cf, pn, pv) => LogItem(me, cf, null, null, pn, pv)
    case log2RegExpr(me, pn, pv) => LogItem(me, null, null, null, pn, pv)
    case _ => throw new Exception("Invalid format")
  }
}.show(false)

And the output:和输出:

+--------------+-----------+---------------+--------------+--------------------------+----------+
|managedElement|cellFDD    |targetFrequency|targetCell    |paramName                 |paramValue|
+--------------+-----------+---------------+--------------+--------------------------+----------+
|LNJ05025D     |DNJ05025D31|AWS_2087       |null          |threshXHighQ              |0         |
|LNJ05024D     |BNJ05024D31|5035           |null          |connectedModeMobilityPrio |7         |
|LNJ05193B     |null       |null           |null          |dlMaxRetxThreshold        |8         |
|LNJ05024D     |DNJ05024D31|null           |null          |enableServiceSpecificHARQ |false     |
|LNJ05024D     |LNJ05024D31|null           |null          |primaryUpperLayerInd      |OFF       |
|LNJ05147D     |LNJ05147D11|2250           |310260-51992-1|cellIndividualOffsetEUtran|0         |
+--------------+-----------+---------------+--------------+--------------------------+----------+

As you can see we return an instance of the case class LogItem after matching one of the given expressions.如您所见,我们在匹配给定表达式之一后返回案例类 LogItem 的实例。

Two things to notice here:这里要注意两点:

  1. You should be cautious to keep the order of matching cases as specified above, from the larger (more matches to extract) to the smaller (less matches) otherwise a log4 can fall under the category log2 since there is still a match!您应该谨慎地保持上面指定的匹配案例的顺序,从较大(要提取的匹配项更多)到较小(更少匹配项),否则 log4 可能属于 log2 类别,因为仍然存在匹配项!

  2. From your examples it seems that EUtranCellRelation contains special characters therefore \\S+ (non space char) is required instead of \\w .从您的示例中,似乎 EUtranCellRelation 包含特殊字符,因此需要\\S+ (非空格字符)而不是\\w

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM