[英]Spark RDDs: How to join value in a Map to a row in an RDD
I have a csv file that I am loading into Spark as an RDD with: 我有一个要以RDD格式加载到Spark的csv文件:
val home_rdd = sc.textFile("hdfs://path/to/home_data.csv")
val home_parsed = home_rdd.map(row => row.split(",").map(_.trim))
val home_header = home_parsed.first
val home_data = home_parsed.filter(_(0) != home_header(0))
home_data
then is: home_data
然后是:
scala> home_data
res17: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at filter at <console>:30
scala> home_data.take(3)
res20: Array[Array[String]] = Array(Array("7129300520", "20141013T000000", 221900, "3", "1", 1180, 5650, "1", 0, 0, 3, 7, 1180, 0, 1955, 0, "98178", 47.5112, -122.257, 1340, 5650), Array("6414100192", "20141209T000000", 538000, "3", "2.25", 2570, 7242, "2", 0, 0, 3, 7, 2170, 400, 1951, 1991, "98125", 47.721, -122.319, 1690, 7639), Array("5631500400", "20150225T000000", 180000, "2", "1", 770, 10000, "1", 0, 0, 3, 6, 770, 0, 1933, 0, "98028", 47.7379, -122.233, 2720, 8062))
I also have a csv of zipcodes to neighborhoods loaded as RDD then used to create a map that is a Map[String,String]
with: 我也有一个邮政编码的csv,它以RDD的形式加载到社区,然后用于创建一个Map[String,String]
的Map[String,String]
其内容如下:
val zip_rdd = sc.textFile("hdfs://path/to/zipcodes.csv")
val zip_parsed = zip_rdd.map(row => row.split(",").map(_.trim))
val zip_header = zip_parsed.first
val zip_data = zip_parsed.filter(_(0) != zip_header(0))
val zip_map = zip_data.map(row => (row(0), row(1))).collectAsMap
val zip_ind = home_header.indexOf("zipcode") //to get the zipcode column in home_data
Where: 哪里:
scala> zip_map.take(3)
res21: scala.collection.Map[String,String] = Map(98151 -> Seattle, 98052 -> Redmond, 98104 -> Seattle)
What I am trying to do next is iterate through home_data
and use the zipcode value in each row (at zip_ind
= 16) to fetch the neighborhood value from zip_map
and append that value to the end of the row. 我接下来要做的是遍历home_data
并使用每行中的邮政编码值(在zip_ind
= 16处)从zip_map
获取邻居值并将该值附加到行末。
val zip_processed = home_data.map(row => row :+ zip_map.get(row(zip_ind)))
But each time it fetches from zip_map, something is failing and so it only appends None
to the end of each row in home_data 但是,每次从zip_map中获取内容时,都会出现故障,因此仅将None
附加到home_data中每一行的末尾
scala> zip_processed.take(3)
res19: Array[Array[java.io.Serializable]] = Array(Array("7129300520", "20141013T000000", 221900, "3", "1", 1180, 5650, "1", 0, 0, 3, 7, 1180, 0, 1955, 0, "98178", 47.5112, -122.257, 1340, 5650, None), Array("6414100192", "20141209T000000", 538000, "3", "2.25", 2570, 7242, "2", 0, 0, 3, 7, 2170, 400, 1951, 1991, "98125", 47.721, -122.319, 1690, 7639, None), Array("5631500400", "20150225T000000", 180000, "2", "1", 770, 10000, "1", 0, 0, 3, 6, 770, 0, 1933, 0, "98028", 47.7379, -122.233, 2720, 8062, None))
I am trying to debug this, but am not sure why it's failing at zip_map.get(row(zip_ind))
. 我正在尝试调试它,但是不确定zip_map.get(row(zip_ind))
为何失败。
I am fairly green with Scala so maybe I am making some bad assumptions, but trying to figure out how to better understand what is happening in the map function. 我对Scala相当满意,所以也许我在做出一些错误的假设,但是试图弄清楚如何更好地了解map函数中正在发生的事情。
Map.get() returns None
when there is no match. 如果没有匹配项,则Map.get()返回None
。 You can use getOrElse
to append the Map value with a fall-back: 您可以使用getOrElse
附加Map值:
val home_data = sc.parallelize(Array(
Array("7129300520", "20141013T000000", 221900, "3", "1", 1180, 5650, "1", 0, 0, 3, 7, 1180, 0, 1955, 0, "98178", 47.5112, -122.257, 1340, 5650),
Array("6414100192", "20141209T000000", 538000, "3", "2.25", 2570, 7242, "2", 0, 0, 3, 7, 2170, 400, 1951, 1991, "98125", 47.721, -122.319, 1690, 7639),
Array("5631500400", "20150225T000000", 180000, "2", "1", 770, 10000, "1", 0, 0, 3, 6, 770, 0, 1933, 0, "98028", 47.7379, -122.233, 2720, 8062)
))
val zip_ind = 16
val zip_map: Map[String, String] = Map("98178" -> "A", "98028" -> "B")
val zip_processed = home_data.map(row => row :+ zip_map.getOrElse(row(zip_ind).toString, "N/A"))
zip_processed.collect
// res1: Array[Array[Any]] = Array(
// Array(7129300520, 20141013T000000, 221900, 3, 1, 1180, 5650, 1, 0, 0, 3, 7, 1180, 0, 1955, 0, 98178, 47.5112, -122.257, 1340, 5650, A),
// Array(6414100192, 20141209T000000, 538000, 3, 2.25, 2570, 7242, 2, 0, 0, 3, 7, 2170, 400, 1951, 1991, 98125, 47.721, -122.319, 1690, 7639, N/A),
// Array(5631500400, 20150225T000000, 180000, 2, 1, 770, 10000, 1, 0, 0, 3, 6, 770, 0, 1933, 0, 98028, 47.7379, -122.233, 2720, 8062, B)
// )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.