[英]spark regex while join data frame
I need to write some regex for condition check in spark while doing some join, 我需要在进行一些连接时编写一些正则表达式以进行状态检查,
My regex should match below string 我的正则表达式应匹配以下字符串
n3_testindia1 = test-india-1
n2_stagamerica2 = stag-america-2
n1_prodeurope2 = prod-europe-2
df1.select("location1").distinct.show() df1.select(“ location1”)。distinct.show()
+----------------+
| location1 |
+----------------+
|n3_testindia1 |
|n2_stagamerica2 |
|n1_prodeurope2 |
df2.select("loc1").distinct.show() df2.select(“ loc1”)。distinct.show()
+--------------+
| loc1 |
+--------------+
|test-india-1 |
|stag-america-2|
|prod-europe-2 |
+--------------+
I want to join based on location columns like below 我想根据以下位置列加入
val joindf = df1.join(df2, df1("location1") == regex(df2("loc1")))
Based on the information above you can do that in Spark 2.4.0 using 根据以上信息,您可以在Spark 2.4.0中使用
val joindf = df1.join(df2,
regexp_extract(df1("location1"), """[^_]+_(.*)""", 1)
=== translate(df2("loc1"), "-", ""))
Or in prior versions something like 或在以前的版本中
val joindf = df1.join(df2,
df1("location1").substr(lit(4), length(df1("location1")))
=== translate(df2("loc1"), "-", ""))
You can split by "_" in location1 and take the 2 element, then match with the entire string of "-" removed string in loc1. 您可以在location1中用“ _”分割并采用2个元素,然后与loc1中“-”已删除字符串的整个字符串匹配。 Check this out: 看一下这个:
scala> val df1 = Seq(("n3_testindia1"),("n2_stagamerica2"),("n1_prodeurope2")).toDF("location1")
df1: org.apache.spark.sql.DataFrame = [location1: string]
scala> val df2 = Seq(("test-india-1"),("stag-america-2"),("prod-europe-2")).toDF("loc1")
df2: org.apache.spark.sql.DataFrame = [loc1: string]
scala> df1.join(df2,split('location1,"_")(1) === regexp_replace('loc1,"-",""),"inner").show
+---------------+--------------+
| location1| loc1|
+---------------+--------------+
| n3_testindia1| test-india-1|
|n2_stagamerica2|stag-america-2|
| n1_prodeurope2| prod-europe-2|
+---------------+--------------+
scala>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.