pyspark: rdd based operation only

Question

I am trying to use only rdd based operations. I have a file something similar to this;

0, Alpha,-3.9, 4, 2001-02-01, 5, 20
0, Beta,-3.8, 3, 2002-02-01, 6, 21
1, Gamma,-3.7, 8, 2003-02-01, 7, 22
0, Alpha,-3.5, 5, 2004-02-01, 8, 23
0, Alpha,-3.9, 6, 2005-02-01, 8, 27

First I load my data into the rdd as follow,

rdd = sc.textFile(myDataset)

Then I am interested in the distinct elements of first elements in each raw. meaning Alpha, Beta, Gamma . In this case i expect 3 distinct elements. This is what I did,

coll = [] # to collect the distinct elements
list_ = rdd.collect() # to get the list
for i in list_:
    result = myFun(i) # this function I created to process line by line and return a tuple.
    if result[1] not in coll:
        coll.append(result[1])

Is there any faster/better way to do this using only rdd based operation?

Answer 1

You can use map with distinct like below:

rdd = sc.textFile('path/to/file/input.txt')
rdd.take(10)
#[u'0, Alpha,-3.9, 4, 2001-02-01, 5, 20', u'0, Beta,-3.8, 3, 2002-02-01, 6, 21', u'1, Gamma,-3.7, 8, 2003-02-01, 7, 22', u'0, Alpha,-3.5, 5, 2004-02-01, 8, 23', u'0, Alpha,-3.9, 6, 2005-02-01, 8, 27']

list_ = rdd.map(lambda line: line.split(",")).map(lambda e : e[1]).distinct().collect() 

list_
[u' Alpha', u' Beta', u' Gamma']

pyspark: rdd based operation only

Question

1 answers

solution1
1 ACCPTED 2019-11-09 19:25:40

pyspark: rdd based operation only

Question

1 answers

solution1 1 ACCPTED 2019-11-09 19:25:40

solution1
1 ACCPTED 2019-11-09 19:25:40