python spark distinct command fails

Question

print (googleRecToToken.take(5))

above command returns

[('http://www.google.com/base/feeds/snippets/11448761432933644608', ['spanish', 'vocabulary', 'builder', 'expand', 'vocabulary', 'contains', 'fun', 'lessons', 'teach', 'entertain', 'll', 'quickly', 'find', 'mastering', 'new', 'terms', 'includes', 'games']), ('http://www.google.com/base/feeds/snippets/8175198959985911471', ['topics', 'presents', 'museums', 'world', '5', 'cd', 'rom', 'set', 'step', 'behind', 'velvet', 'rope', 'examine', 'treasured', 'collections', 'antiquities', 'art', 'inventions', 'includes', 'following', 'louvre', 'virtual', 'visit', '25', 'rooms', 'full', 'screen', 'interactive', 'video', 'detailed', 'map', 'louvre']), ('http://www.google.com/base/feeds/snippets/18445827127704822533', ['sierrahome', 'hse', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'sierrahome']), ('http://www.google.com/base/feeds/snippets/18274317756231697680', ['adobe', 'cs3', 'production', 'premium', 'academic', 'system', 'requirements', 'multicore', 'intel', 'processor', 'adobe', 'photoshop', 'extended', 'illustrator', 'flash', 'professional', 'effects', 'professional', 'universal', 'binary', 'also', 'work', 'powerpc', 'g4', 'g5', 'processor', 'adobe', 'onlocation', 'windows']), ('http://www.google.com/base/feeds/snippets/18409551702230917208', ['equisys', 'premium', 'support', 'zetafax', '2007', 'technical', 'support', '1', 'year', 'equisys', 'premium', 'support', 'zetafax', '2007', 'upgrade', 'license', '10', 'users', 'technical', 'support', 'phone', 'consulting', '1', 'year', '2', 'h'])]

Command

vendorRDD.map(lambda x: len(x[1])).reduce(lambda x, y : x + y)

works on that RDD. But the below command fails. Why?

print (RDD.distinct().countByKey())

The error is as below

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-58-dae073faec09> in <module>()
     10 print (googleRecToToken.take(5))
     11 print ('\n')
---> 12 print (googleRecToToken.distinct().countByKey())
     13 
     14 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByKey(self)
   1516         [('a', 2), ('b', 1)]
   1517         """
-> 1518         return self.map(lambda x: x[0]).countByValue()
   1519 
   1520     def join(self, other, numPartitions=None):

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByValue(self)
   1135                 m1[k] += v
   1136             return m1
-> 1137         return self.mapPartitions(countPartition).reduce(mergeMaps)
   1138 
   1139     def top(self, num, key=None):

Answer 1

I won't guarantee that this is the only issue with your could but fundamental problem is that distinct cannot work on data you've shown.

It requires hash partitioning on the complete (key, value) pair and Python lists (values in your data) are not hashable.

You can convert these to some hashable type first, for example:

vendorRDD.mapValues(tuple).distinct()

See also A list as a key for PySpark's reduceByKey

python spark distinct command fails

Question

1 answers

solution1
0 2016-02-24 22:47:26

python spark distinct command fails

Question

1 answers

solution1 0 2016-02-24 22:47:26

solution1
0 2016-02-24 22:47:26