如何在PySpark中获得不同的RDD RDD？

Question

I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. 我有一个字典的RDD，我想得到一个只有不同元素的RDD。 However, when I try to call 但是，当我试着打电话的时候

rdd.distinct()

PySpark gives me the following error PySpark给了我以下错误

TypeError: unhashable type: 'dict'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/02/19 16:55:56 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 1776, in combineLocally
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'dict'

I do have a key inside of the dict that I could use as the distinct element, but the documentation doesn't give any clues on how to solve this problem. 我在dict里面有一个键，我可以用它作为独特的元素，但是文档没有提供任何关于如何解决这个问题的线索。

EDIT: The content is made up of strings, arrays of strings, and a dictionary of numbers 编辑：内容由字符串，字符串数组和数字字典组成

EDIT 2: Example of a dictionary... I'd like dicts with equal "data_fingerprint" keys to be considered equal: 编辑2：字典的例子......我希望用相同的“data_fingerprint”键的dicts被认为是相等的：

{"id":"4eece341","data_fingerprint":"1707db7bddf011ad884d132bf80baf3c"}

Thanks 谢谢

Answer 1

As @zero323 pointed out in his comment you have to decide how to compare dictionaries as they are not hashable. 正如@ zero323在他的评论中指出的那样，你必须决定如何比较字典，因为它们不是可以清除的。 One way would be to sort the keys (as they are not in any particular order) for example by lexycographic order. 一种方法是对密钥进行排序（因为它们不是以任何特定的顺序排列），例如通过lexycographic顺序。 Then create a string of the form: 然后创建一个表单的字符串：

def dict_to_string(dict):
    ...
    return 'key1|value1|key2|value2...|keyn|valuen'

If you have nested unhashable objects you have to do this recursively. 如果嵌套了不可用的对象，则必须以递归方式执行此操作。

Now you can just transform your RDD to pair with string as a key (or some kind of hash of it) 现在你可以将你的RDD转换为与字符串配对作为键（或者它的某种哈希）

pairs = dictRDD.map(lambda d: (dict_to_string(d), d))

To get what you want you just have to reduce by key as fallows 为了获得你想要的东西，你只需要减少钥匙作为休耕

distinctDicts = pairs.reduceByKey(lambda val1, val2: val1).values()

Answer 2

Since your data provides an unique key you can simply do something like this: 由于您的数据提供了一个唯一的密钥，您可以简单地执行以下操作：

(rdd
    .keyBy(lambda d: d.get("data_fingerprint"))
    .reduceByKey(lambda x, y: x)
    .values())

There are at least two problems with Python dictionaries which make them bad candidates for hashing: Python字典至少存在两个问题，这些问题使它们成为散列的不良候选者：

mutability - which makes any hashing tricky 可变性 - 这使得任何散列都很棘手
arbitrary order of keys 任意顺序的键

A while ago there was a PEP proposing frozerdicts ( PEP 0416 ) but it was finally rejected. 不久之前，有一个PEP提议frozerdicts （ PEP 0416 ），但最终被拒绝了。

如何在PySpark中获得不同的RDD RDD？

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-02-19 17:03:24

解决方案2
1 2016-02-19 22:15:38

如何在PySpark中获得不同的RDD RDD？

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-02-19 17:03:24

解决方案2 1 2016-02-19 22:15:38

解决方案1
2 已采纳 2016-02-19 17:03:24

解决方案2
1 2016-02-19 22:15:38