简体   繁体   English

用ZipWithIndex / ZipWithUniqueID替换字符串

[英]replace strings with ZipWithIndex/ZipWithUniqueID

I am trying to replace a certain string to number using ZipWithIndex OR ZipWithUniqueID 我正在尝试使用ZipWithIndex或ZipWithUniqueID将某个字符串替换为数字

lets say I have this format 可以说我有这种格式

("u1",("name", "John Sam"))
("u2",("age", "twinty Four"))
("u3",("name", "sam Blake"))

I want this result 我想要这个结果

(0,(3,4))
(1,(5,6))
(2,(3,8))

what I did is to extract the first element in the key, value pair so I did 我所做的是提取键值对中的第一个元素,所以我做了

val first = file.map(line=> line._1).distinct()
then apply ZipWithIndex
val z1= first.ZipWithIndex()

I got result like this 我得到这样的结果

("u1",0)
("u2",1)
("u3",2)

now I need to take the ids/numbers and change it in my original file. 现在,我需要获取ID /数字并在原始文件中进行更改。 and I need to keep all the distinct ids/numbers in hashTable to be able to look for them later on. 并且我需要将所有不同的ID /数字保留在hashTable中,以便以后可以查找它们。 is there any way to do that? 有什么办法吗? Any suggestions? 有什么建议么?

I hope you got my question 我希望你能回答我的问题

With

val rdd = spark.sparkContext.parallelize(Seq(
  ("name", "John"), ("age", "twinty"), ("name", "sam")
))

flatten the data: 展平数据:

val flat = rdd.flatMap { case (x, y) => Seq(x, y) }

get unique values: 获得唯一值:

val unique = flat.distinct

Index and collect as map: 索引并收集为地图:

val map = unique.zipWithIndex.collectAsMap

Go back and map : 返回map

val indexed = rdd.map { case (x, y) => (map(x), map(y)) }

Enjoy the reuslt 享受重用

indexed.toLocalIterator.foreach(println)
(2,4)
(3,0)
(2,1)

Edit : 编辑

With rewritten questions, replace the first step with: 对于重写的问题,将第一步替换为:

val flat = rdd.flatMap { case (x, (y, z)) => Seq(x, y, z) }

and the last step with: 最后一步:

val indexed = rdd.map { case (x, (y, z)) => (map(x), (map(y), map(z))) }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM