[英]Apache Spark: how to transform Data Frame column with regex to another Data Frame?
I have Spark Data Frame 1 of several columns: (user_uuid, url, date_visit) 我有几列的Spark Data Frame 1:(user_uuid,url,date_visit)
I want to transform this DF1 to Data Frame 2 with form : (user_uuid, domain, date_visit) 我想用以下形式将此DF1转换为数据帧2:(user_uuid,domain,date_visit)
What I wanted to use is regular expression to detect domain and apply it to DF1 val regexpr = """(?i)^((https?):\\/\\/)?((www|www1)\\.)?([\\w-\\.]+)""".r
我想要使用的是正则表达式来检测域并将其应用于DF1
val regexpr = """(?i)^((https?):\\/\\/)?((www|www1)\\.)?([\\w-\\.]+)""".r
Could you please help me composing code to transform Data Frames in Scala? 你能帮我编写代码来转换Scala中的数据框吗? I am completely new to Spark and Scala and syntax is hard.
我是Spark和Scala的新手,语法很难。 Thanks!
谢谢!
Spark >= 1.5 : Spark> = 1.5 :
You can use regexp_extract
function: 您可以使用
regexp_extract
函数:
import org.apache.spark.sql.functions.regexp_extract
val patter: String = ???
val groupIdx: Int = ???
df.withColumn("domain", regexp_extract(url, pattern, groupIdx))
Spark < 1.5.0 Spark <1.5.0
Define an UDF 定义UDF
val pattern: scala.util.matching.Regex = ???
def getFirst(pattern: scala.util.matching.Regex) = udf(
(url: String) => pattern.findFirstIn(url) match {
case Some(domain) => domain
case None => "unknown"
}
)
Use defined UDF: 使用定义的UDF:
df.select(
$"user_uuid",
getFirst(pattern)($"url").alias("domain"),
$"date_visit"
)
or register temp table: 或者注册临时表:
df.registerTempTable("df")
sqlContext.sql(s"""
SELECT user_uuid, regexp_extract(url, '$pattern', $group_idx) AS domain, date_visit FROM df""")
Replace pattern
with a valid Java regexp and group_id
with an index of the group. 将
pattern
替换为有效的Java regexp,将group_id
替换为组的索引。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.