简体   繁体   English

如何在scala中使用正则表达式模式匹配替换部分字符串?

[英]How to replace part of string using regex pattern matching in scala?

I have a String which contains column names and datatypes as below: 我有一个包含列名和数据类型的String,如下所示:

val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"

My requirement is to convert the postgres datatypes which are present as numeric, numeric(15,10) into spark-sql compatible datatypes. 我的要求是将postgres数据类型转换为spark-sql兼容数据类型,这些数据类型以numeric, numeric(15,10)形式存在。 In this case, 在这种情况下,

numeric         -> decimal(38,30)
numeric(15,10)  -> decimal(15,10)
numeric(20,0)   -> bigint   (This is an integeral datatype as there its precision is zero.)

In order to access the datatype in the string: cdt, I split it and created a Seq from it. 为了访问字符串中的数据类型:cdt,我将其拆分并从中创建一个Seq。

val dt = cdt.split("\\|").toSeq

Now I have a Seq of elements in which each element is a String in the below format: 现在我有一个Seq元素,其中每个元素都是以下格式的String:

Seq("header:integer", "releaseNumber:numeric","amountCredit:numeric","lastUpdatedBy:numeric(15,10)","orderNumber:numeric(20,0)")

I have the pattern matching regex: """numeric\\(\\d+,(\\d+)\\)""".r , for numeric(precision, scale) which only works if there is a scale of two digits, ex: numeric(20,23). 我有模式匹配正则表达式: """numeric\\(\\d+,(\\d+)\\)""".r ,数字(精度,比例),仅当有两位数的比例时才有效,例如:数字(20,23)。 I am very new to REGEX and Scala & I don't understand how to create regex pattterns for the remaining two cases & apply it on a string to match a condition. 我是REGEX和Scala的新手,我不明白如何为剩下的两种情况创建正则表达式pattterns并将其应用于字符串以匹配条件。 I tried it in the below way but it gives me a compilation error: "Cannot resolve symbol findFirstMatchIn" 我尝试了以下方式,但它给了我一个编译错误:“无法解析符号findFirstMatchIn”

dt.map(e => e.split("\\:")).map(e => changeDataType(e(0), e(1)))
 def changeDataType(colName: String, cd:String): String = {
    val finalColumns = new String
    val pattern1 = """numeric\(\d+,(\d+)\)""".r
    cd match {
      case pattern1.findFirstMatchIn(dt) =>
    }
  }

I am trying to get the final output into a String as below: 我试图将最终输出变为String,如下所示:

header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint

How to multiple regex patterns for different cases to check/apply pattern matching on datatype of each value in the seq and change it to my suitable datatype as mentioned above. 如何针对不同的情况使用多个正则表达式模式来检查/应用seq中每个值的数据类型的模式匹配,并将其更改为我上面提到的合适的数据类型。

Could anyone let me know how can I achieve it ? 任何人都可以告诉我如何实现它?

It can be done with a single regex pattern, but some testing of the match results is required. 它可以使用单个正则表达式模式完成,但需要对匹配结果进行一些测试。

val numericRE = raw"([^:]+):numeric(?:\((\d+),(\d+)\))?".r

cdt.split("\\|")
   .map{
     case numericRE(col,a,b) =>
       if (Option(b).isEmpty) s"$col:decimal(38,30)"
       else if (b == "0")     s"$col:bigint"
       else                   s"$col:decimal($a,$b)"
     case x => x  //pass-through
  }.mkString("|")

//res0: String = header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint

Of course it can be done with three different regex patterns, but I think this is pretty clear. 当然可以使用三种不同的正则表达式模式来完成,但我认为这很清楚。


explanation 说明

  • raw - don't need so many escape characters - \\ raw - 不需要这么多逃脱角色 - \\
  • ([^:]+) - capture everything up to the 1st colon ([^:]+) - 捕获第一个冒号的所有内容
  • :numeric - followed by the string ":numeric" :numeric - 后跟字符串“:numeric”
  • (?: - start a non-capture group (?: - 开始一个非捕获组
  • \\((\\d+),(\\d+)\\) - capture the 2 digit strings, separated by a comma, inside parentheses \\((\\d+),(\\d+)\\) - 在括号内捕获由逗号分隔的2位数字符串
  • )? - the non-capture group is optional - 非捕获组是可选的
  • numericRE(col,a,b) - col is the 1st capture group, a and b are the digit captures, but they are inside the optional non-capture group so they might be null numericRE(col,a,b) - col是第一个捕获组, ab是数字捕获,但它们位于可选的非捕获组内,因此它们可能为null

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM