[英]How can I loop over every item a row in Pyspark RDD and turn them into keys? Use map function?
所以,首先我有一些這樣的輸入:
A:<phone1,phone2>,<location1>,<email1>
B:<phone1>,<location2>,<email1,email2>
我想使用 Pyspark.rdd.map() 函數在行中的每次循環並將它們轉換為這樣的鍵值對:
phone1: A:<phone1,phone2>,<location1>,<email1>
phone1: B:<phone1>,<location2>,<email1,email2>
phone2: A:<phone1,phone2>,<location1>,<email1>
location1: A:<phone1,phone2>,<location1>,<email1>
location2: B:<phone1>,<location2>,<email1,email2>
email1: A:<phone1,phone2>,<location1>,<email1>
email1: B:<phone1>,<location2>,<email1,email2>
email2: B:<phone1>,<location2>,<email1,email2>
在我之前的嘗試中,我嘗試在 map 函數內部的 lambda 函數上添加一個循環,但它不支持它。 有沒有其他辦法?
scala> val rdd = sc.parallelize(Seq("A:<phone1,phone2>,<location1>,<email1>", "B:<phone1>,<location2>,<email1,email2>"))
scala> rdd.foreach(println)
A:<phone1,phone2>,<location1>,<email1>
B:<phone1>,<location2>,<email1,email2>
scala> case class dataclass(c0:String, c1:String)
scala> val df = rdd.map(x => x.split(":")).map(y => dataclass(y(0), y(1))).toDF
scala> df.show(false)
+---+------------------------------------+
|c0 |c1 |
+---+------------------------------------+
|A |<phone1,phone2>,<location1>,<email1>|
|B |<phone1>,<location2>,<email1,email2>|
+---+------------------------------------+
scala> val df1 = df.withColumn("tempCol",regexp_replace(regexp_replace(col("c1"), "<", ""),">", ""))
.withColumn("tempCol", explode(split(col("tempCol"), ",")))
.withColumn("out", concat(col("tempCol"), lit(":"), col("c0"), lit(":"), col("c1")))
.drop("c0", "c1", "tempCol")
scala> df1.show(false)
+------------------------------------------------+
|out |
+------------------------------------------------+
|phone1:A:<phone1,phone2>,<location1>,<email1> |
|phone2:A:<phone1,phone2>,<location1>,<email1> |
|location1:A:<phone1,phone2>,<location1>,<email1>|
|email1:A:<phone1,phone2>,<location1>,<email1> |
|phone1:B:<phone1>,<location2>,<email1,email2> |
|location2:B:<phone1>,<location2>,<email1,email2>|
|email1:B:<phone1>,<location2>,<email1,email2> |
|email2:B:<phone1>,<location2>,<email1,email2> |
+------------------------------------------------+
scala> val rdd2 = df1.rdd.map(_(0))
scala> rdd2.foreach(println)
phone1:A:<phone1,phone2>,<location1>,<email1>
phone2:A:<phone1,phone2>,<location1>,<email1>
location1:A:<phone1,phone2>,<location1>,<email1>
email1:A:<phone1,phone2>,<location1>,<email1>
phone1:B:<phone1>,<location2>,<email1,email2>
location2:B:<phone1>,<location2>,<email1,email2>
email1:B:<phone1>,<location2>,<email1,email2>
email2:B:<phone1>,<location2>,<email1,email2>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.