pyspark：重新分区后出现“太多值”错误

Question

I have a DataFrame (that gets converted to RDD) and would like to repartition so that each key (first column) has its own partition. 我有一个DataFrame（转换为RDD）并希望重新分区，以便每个键（第一列）都有自己的分区。 This is what I did: 这就是我做的：

# Repartition to # key partitions and map each row to a partition given their key rank
my_rdd = df.rdd.partitionBy(len(keys), lambda row: int(row[0]))

However, when I try to map it back to DataFrame or save it, then I get this error: 但是，当我尝试将其映射回DataFrame或保存它时，我收到此错误：

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
        process()
      File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",     line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
    for obj in iterator:
  File "spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1703, in add_shuffle_key
    for k, v in iterator:
ValueError: too many values to unpack

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

A bit more testing revealed that even this causes the same error: my_rdd = df.rdd.partitionBy(x) # x = can be 5, 100, etc 更多测试显示，即使这会导致相同的错误：my_rdd = df.rdd.partitionBy（x）#x =可以是5,100等

Have any of you encountered this before. 你们有没有遇到过这个。 If so how did you solve it? 如果是这样，你是如何解决的？

Answer 1

partitionBy requires a PairwiseRDD which in Python is equivalent to RDD of tuples (lists) of length 2 where the first element is a key and the second one is a value. partitionBy需要PairwiseRDD ，它在Python中等效于长度为2的元组（列表）的RDD ，其中第一个元素是键，第二个元素是值。

partitionFunc takes the key and maps it to the partition number. partitionFunc获取密钥并将其映射到分区号。 When you use it on a RDD[Row] it tries to unpack row into a key an value and fails: 当您在RDD[Row]上使用它时，它会尝试将行解包为键值并失败：

from pyspark.sql import Row

row = Row(1, 2, 3)
k, v = row

## Traceback (most recent call last):
##   ...
## ValueError: too many values to unpack (expected 2)

Even if you provide a correct data doing something like this: 即使您提供了正确的数据，执行以下操作：

my_rdd = (df.rdd.map(lambda row: (int(row[0]), row)).partitionBy(len(keys))

it wouldn't really make sense. 它真的没有意义。 Partitioning is not particularly meaningful in case of a DataFrames . 在DataFrames情况下，分区不是特别有意义。 See my answer to How to define partitioning of DataFrame? 请参阅我如何定义DataFrame分区的答案？ for more details. 更多细节。

pyspark：重新分区后出现“太多值”错误

问题描述

1 个解决方案

解决方案1
2 2015-11-20 23:32:08

pyspark：重新分区后出现“太多值”错误

问题描述

1 个解决方案

解决方案1 2 2015-11-20 23:32:08

解决方案1
2 2015-11-20 23:32:08