简体   繁体   中英

python spark distinct command fails

print (googleRecToToken.take(5))

above command returns

[('http://www.google.com/base/feeds/snippets/11448761432933644608', ['spanish', 'vocabulary', 'builder', 'expand', 'vocabulary', 'contains', 'fun', 'lessons', 'teach', 'entertain', 'll', 'quickly', 'find', 'mastering', 'new', 'terms', 'includes', 'games']), ('http://www.google.com/base/feeds/snippets/8175198959985911471', ['topics', 'presents', 'museums', 'world', '5', 'cd', 'rom', 'set', 'step', 'behind', 'velvet', 'rope', 'examine', 'treasured', 'collections', 'antiquities', 'art', 'inventions', 'includes', 'following', 'louvre', 'virtual', 'visit', '25', 'rooms', 'full', 'screen', 'interactive', 'video', 'detailed', 'map', 'louvre']), ('http://www.google.com/base/feeds/snippets/18445827127704822533', ['sierrahome', 'hse', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'sierrahome']), ('http://www.google.com/base/feeds/snippets/18274317756231697680', ['adobe', 'cs3', 'production', 'premium', 'academic', 'system', 'requirements', 'multicore', 'intel', 'processor', 'adobe', 'photoshop', 'extended', 'illustrator', 'flash', 'professional', 'effects', 'professional', 'universal', 'binary', 'also', 'work', 'powerpc', 'g4', 'g5', 'processor', 'adobe', 'onlocation', 'windows']), ('http://www.google.com/base/feeds/snippets/18409551702230917208', ['equisys', 'premium', 'support', 'zetafax', '2007', 'technical', 'support', '1', 'year', 'equisys', 'premium', 'support', 'zetafax', '2007', 'upgrade', 'license', '10', 'users', 'technical', 'support', 'phone', 'consulting', '1', 'year', '2', 'h'])]

Command

vendorRDD.map(lambda x: len(x[1])).reduce(lambda x, y : x + y)

works on that RDD. But the below command fails. Why?

print (RDD.distinct().countByKey())

The error is as below

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-58-dae073faec09> in <module>()
     10 print (googleRecToToken.take(5))
     11 print ('\n')
---> 12 print (googleRecToToken.distinct().countByKey())
     13 
     14 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByKey(self)
   1516         [('a', 2), ('b', 1)]
   1517         """
-> 1518         return self.map(lambda x: x[0]).countByValue()
   1519 
   1520     def join(self, other, numPartitions=None):

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByValue(self)
   1135                 m1[k] += v
   1136             return m1
-> 1137         return self.mapPartitions(countPartition).reduce(mergeMaps)
   1138 
   1139     def top(self, num, key=None):

I won't guarantee that this is the only issue with your could but fundamental problem is that distinct cannot work on data you've shown.

It requires hash partitioning on the complete (key, value) pair and Python lists (values in your data) are not hashable.

You can convert these to some hashable type first, for example:

vendorRDD.mapValues(tuple).distinct()

See also A list as a key for PySpark's reduceByKey

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM