[英]How to process a nested Key Value Pair in Spark / Scala data import
我是Spark和Scala的新手,所以请原谅。 我所拥有的是这种格式的文本文件:
328;ADMIN HEARNG;[street#939 W El Camino,city#Chicago,state#IL]
我已经能够使用sc.textFile命令创建RDD,并且可以使用以下命令处理每个部分:
val department_record = department_rdd.map(record => record.split(";"))
如您所见,第3个元素是一个嵌套键/值对,到目前为止,我一直无法使用它。 我正在寻找的是一种将上述数据转换为如下所示的RDD的方法:
|ID |NAME |STREET |CITY |STATE|
|328|ADMIN HEARNG|939 W El Camino|Chicago|IL |
任何帮助表示赞赏。
可以在分割地址字段,
为Array,在再次剥去包围支架和分裂#
以提取有用地址组件,如下所示:
val department_rdd = sc.parallelize(Seq(
"328;ADMIN HEARNG;[street#939 W El Camino,city#Chicago,state#IL]",
"400;ADMIN HEARNG;[street#800 First Street,city#San Francisco,state#CA]"
))
val department_record = department_rdd.
map(_.split(";")).
map{ case Array(id, name, address) =>
val addressArr = address.split(",").
map(_.replaceAll("^\\[|\\]$", "").split("#"))
(id, name, addressArr(0)(1), addressArr(1)(1), addressArr(2)(1))
}
department_record.collect
// res1: Array[(String, String, String, String, String)] = Array(
// (328,ADMIN HEARNG,939 W El Camino,Chicago,IL),
// (400,ADMIN HEARNG,800 First Street,San Francisco,CA)
// )
如果要转换为DataFrame,只需应用toDF()
:
department_record.toDF("id", "name", "street", "city", "state").show
// +---+------------+----------------+-------------+-----+
// | id| name| street| city|state|
// +---+------------+----------------+-------------+-----+
// |328|ADMIN HEARNG| 939 W El Camino| Chicago| IL|
// |400|ADMIN HEARNG|800 First Street|San Francisco| CA|
// +---+------------+----------------+-------------+-----+
DF解决方案:
scala> val df = Seq(("328;ADMIN HEARNG;[street#939 W El Camino,city#Chicago,state#IL]"),
| ("400;ADMIN HEARNG;[street#800 First Street,city#San Francisco,state#CA]")).toDF("dept")
df: org.apache.spark.sql.DataFrame = [dept: string]
scala> val df2 =df.withColumn("arr",split('dept,";")).withColumn("address",split(regexp_replace('arr(2),"\\[|\\]",""),"#"))
df2: org.apache.spark.sql.DataFrame = [dept: string, arr: array<string> ... 1 more field]
scala> df2.select('arr(0) as "id",'arr(1) as "name",split('address(1),",")(0) as "street",split('address(2),",")(0) as "city",'address(3) as "state").show
+---+------------+----------------+-------------+-----+
| id| name| street| city|state|
+---+------------+----------------+-------------+-----+
|328|ADMIN HEARNG| 939 W El Camino| Chicago| IL|
|400|ADMIN HEARNG|800 First Street|San Francisco| CA|
+---+------------+----------------+-------------+-----+
scala>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.