用ZipWithIndex / ZipWithUniqueID替换字符串

Question

I am trying to replace a certain string to number using ZipWithIndex OR ZipWithUniqueID 我正在尝试使用ZipWithIndex或ZipWithUniqueID将某个字符串替换为数字

lets say I have this format 可以说我有这种格式

("u1",("name", "John Sam"))
("u2",("age", "twinty Four"))
("u3",("name", "sam Blake"))

I want this result 我想要这个结果

(0,(3,4))
(1,(5,6))
(2,(3,8))

what I did is to extract the first element in the key, value pair so I did 我所做的是提取键值对中的第一个元素，所以我做了

val first = file.map(line=> line._1).distinct()
then apply ZipWithIndex
val z1= first.ZipWithIndex()

I got result like this 我得到这样的结果

("u1",0)
("u2",1)
("u3",2)

now I need to take the ids/numbers and change it in my original file. 现在，我需要获取ID /数字并在原始文件中进行更改。 and I need to keep all the distinct ids/numbers in hashTable to be able to look for them later on. 并且我需要将所有不同的ID /数字保留在hashTable中，以便以后可以查找它们。 is there any way to do that? 有什么办法吗？ Any suggestions? 有什么建议么？

I hope you got my question 我希望你能回答我的问题

Answer 1

With 用

val rdd = spark.sparkContext.parallelize(Seq(
  ("name", "John"), ("age", "twinty"), ("name", "sam")
))

flatten the data: 展平数据：

val flat = rdd.flatMap { case (x, y) => Seq(x, y) }

get unique values: 获得唯一值：

val unique = flat.distinct

Index and collect as map: 索引并收集为地图：

val map = unique.zipWithIndex.collectAsMap

Go back and map : 返回map ：

val indexed = rdd.map { case (x, y) => (map(x), map(y)) }

Enjoy the reuslt 享受重用

indexed.toLocalIterator.foreach(println)
(2,4)
(3,0)
(2,1)

Edit : 编辑：

With rewritten questions, replace the first step with: 对于重写的问题，将第一步替换为：

val flat = rdd.flatMap { case (x, (y, z)) => Seq(x, y, z) }

and the last step with: 最后一步：

val indexed = rdd.map { case (x, (y, z)) => (map(x), (map(y), map(z))) }

用ZipWithIndex / ZipWithUniqueID替换字符串

问题描述

1 个解决方案

解决方案1
2 2018-01-25 17:15:39

用ZipWithIndex / ZipWithUniqueID替换字符串

问题描述

1 个解决方案

解决方案1 2 2018-01-25 17:15:39

解决方案1
2 2018-01-25 17:15:39