[英]How do I match multiple regex patterns against a column value in Spark?
我有专栏:
val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!@"
).toDF("raw_type")
val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
但它给出:
+---------------------------------------------------------+----------------------+
|raw_type |updatedType |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT|
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST |
+---------------------------------------------------------+----------------------+
但是我希望“ Testing(2,4,(4,6,7)foo,Foo购买计数1太低”才能给出2个值“ TESTING,TOO_LOW_PURCHASE_COUNT”,如下所示:
+---------------------------------------------------------+--------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST, Unkown |
+---------------------------------------------------------+--------------------------------+
有人可以告诉我我在做什么错吗?
好。 所以,这里有几件事,
关于find
,您需要对照每个正则表达式检查每个Row
以获取所需的输出,因此find不是正确的选择。
迭代器产生的满足谓词(如果有)的第一个值。
请注意正则表达式,低位后留了一个空格,这就是为什么它不匹配的原因。 也许您也应该重新考虑将%
替换为.*
,
%purchase count % is too low %
因此,随着更改,您的代码将类似于
val originalSqlLikePatternMap = Map(
"item (%) is blacklisted%" -> "BLACK_LIST",
"%Testing%" -> "TESTING",
"%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")
val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)
val df = Seq(
"Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
"Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
"item (1) is blacklisted, #!@"
).toDF("raw_type")
val converter = (value: String) => {
val res = javaPatternMap.map(v => {
v._1.findFirstIn(value) match {
case Some(_) => v._2
case None => ""
}
})
.filter(_.nonEmpty).mkString(", ")
if (res.isEmpty) "Unknown" else res
}
val converterUDF = udf(converter)
val result = df.withColumn("updatedType", converterUDF($"raw_type"))
result.show(false)
输出,
+---------------------------------------------------------+-------------------------------+
|raw_type |updatedType |
+---------------------------------------------------------+-------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
|Foo purchase count (12, 4) is too low |TOO_LOW_PURCHASE_COUNT |
|#!@ |Unknown |
|item (mejwnw) is blacklisted |BLACK_LIST |
|item (1) is blacklisted, #!@ |BLACK_LIST |
+---------------------------------------------------------+-------------------------------+
希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.