print (googleRecToToken.take(5))
above command returns
[('http://www.google.com/base/feeds/snippets/11448761432933644608', ['spanish', 'vocabulary', 'builder', 'expand', 'vocabulary', 'contains', 'fun', 'lessons', 'teach', 'entertain', 'll', 'quickly', 'find', 'mastering', 'new', 'terms', 'includes', 'games']), ('http://www.google.com/base/feeds/snippets/8175198959985911471', ['topics', 'presents', 'museums', 'world', '5', 'cd', 'rom', 'set', 'step', 'behind', 'velvet', 'rope', 'examine', 'treasured', 'collections', 'antiquities', 'art', 'inventions', 'includes', 'following', 'louvre', 'virtual', 'visit', '25', 'rooms', 'full', 'screen', 'interactive', 'video', 'detailed', 'map', 'louvre']), ('http://www.google.com/base/feeds/snippets/18445827127704822533', ['sierrahome', 'hse', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'sierrahome']), ('http://www.google.com/base/feeds/snippets/18274317756231697680', ['adobe', 'cs3', 'production', 'premium', 'academic', 'system', 'requirements', 'multicore', 'intel', 'processor', 'adobe', 'photoshop', 'extended', 'illustrator', 'flash', 'professional', 'effects', 'professional', 'universal', 'binary', 'also', 'work', 'powerpc', 'g4', 'g5', 'processor', 'adobe', 'onlocation', 'windows']), ('http://www.google.com/base/feeds/snippets/18409551702230917208', ['equisys', 'premium', 'support', 'zetafax', '2007', 'technical', 'support', '1', 'year', 'equisys', 'premium', 'support', 'zetafax', '2007', 'upgrade', 'license', '10', 'users', 'technical', 'support', 'phone', 'consulting', '1', 'year', '2', 'h'])]
Command
vendorRDD.map(lambda x: len(x[1])).reduce(lambda x, y : x + y)
works on that RDD. But the below command fails. Why?
print (RDD.distinct().countByKey())
The error is as below
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-58-dae073faec09> in <module>()
10 print (googleRecToToken.take(5))
11 print ('\n')
---> 12 print (googleRecToToken.distinct().countByKey())
13
14
/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByKey(self)
1516 [('a', 2), ('b', 1)]
1517 """
-> 1518 return self.map(lambda x: x[0]).countByValue()
1519
1520 def join(self, other, numPartitions=None):
/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByValue(self)
1135 m1[k] += v
1136 return m1
-> 1137 return self.mapPartitions(countPartition).reduce(mergeMaps)
1138
1139 def top(self, num, key=None):
I won't guarantee that this is the only issue with your could but fundamental problem is that distinct
cannot work on data you've shown.
It requires hash partitioning on the complete (key, value) pair and Python lists
(values in your data) are not hashable.
You can convert these to some hashable type first, for example:
vendorRDD.mapValues(tuple).distinct()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.