[英]replace strings with ZipWithIndex/ZipWithUniqueID
I am trying to replace a certain string to number using ZipWithIndex OR ZipWithUniqueID 我正在尝试使用ZipWithIndex或ZipWithUniqueID将某个字符串替换为数字
lets say I have this format 可以说我有这种格式
("u1",("name", "John Sam"))
("u2",("age", "twinty Four"))
("u3",("name", "sam Blake"))
I want this result 我想要这个结果
(0,(3,4))
(1,(5,6))
(2,(3,8))
what I did is to extract the first element in the key, value pair so I did 我所做的是提取键值对中的第一个元素,所以我做了
val first = file.map(line=> line._1).distinct()
then apply ZipWithIndex
val z1= first.ZipWithIndex()
I got result like this 我得到这样的结果
("u1",0)
("u2",1)
("u3",2)
now I need to take the ids/numbers and change it in my original file. 现在,我需要获取ID /数字并在原始文件中进行更改。 and I need to keep all the distinct ids/numbers in hashTable to be able to look for them later on.
并且我需要将所有不同的ID /数字保留在hashTable中,以便以后可以查找它们。 is there any way to do that?
有什么办法吗? Any suggestions?
有什么建议么?
I hope you got my question 我希望你能回答我的问题
With 用
val rdd = spark.sparkContext.parallelize(Seq(
("name", "John"), ("age", "twinty"), ("name", "sam")
))
flatten the data: 展平数据:
val flat = rdd.flatMap { case (x, y) => Seq(x, y) }
get unique values: 获得唯一值:
val unique = flat.distinct
Index and collect as map: 索引并收集为地图:
val map = unique.zipWithIndex.collectAsMap
Go back and map
: 返回
map
:
val indexed = rdd.map { case (x, y) => (map(x), map(y)) }
Enjoy the reuslt 享受重用
indexed.toLocalIterator.foreach(println)
(2,4)
(3,0)
(2,1)
Edit : 编辑 :
With rewritten questions, replace the first step with: 对于重写的问题,将第一步替换为:
val flat = rdd.flatMap { case (x, (y, z)) => Seq(x, y, z) }
and the last step with: 最后一步:
val indexed = rdd.map { case (x, (y, z)) => (map(x), (map(y), map(z))) }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.