如何在pyspark中将DataFrame转换回普通RDD？

Question

I need to use the 我需要使用

(rdd.)partitionBy(npartitions, custom_partitioner)

method that is not available on the DataFrame. DataFrame上不可用的方法。 All of the DataFrame methods refer only to DataFrame results. 所有DataFrame方法仅引用DataFrame结果。 So then how to create an RDD from the DataFrame data? 那么如何从DataFrame数据创建RDD呢？

Note: this is a change (in 1.3.0) from 1.2.0. 注意：这是从1.2.0开始的更改（在1.3.0中）。

Update from the answer from @dpangmao: the method is .rdd. 从@dpangmao的答案更新：方法是.rdd。 I was interested to understand if (a) it were public and (b) what are the performance implications. 我有兴趣了解（a）它是否公开以及（b）性能影响是什么。

Well (a) is yes and (b) - well you can see here that there are significant perf implications: a new RDD must be created by invoking mapPartitions : 那么（a）是肯定的和（b） - 你可以在这里看到有重要的性能影响：必须通过调用mapPartitions创建一个新的RDD：

In dataframe.py (note the file name changed as well (was sql.py): 在dataframe.py中 （注意文件名也改变了（是sql.py）：

@property
def rdd(self):
    """
    Return the content of the :class:`DataFrame` as an :class:`RDD`
    of :class:`Row` s.
    """
    if not hasattr(self, '_lazy_rdd'):
        jrdd = self._jdf.javaToPython()
        rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer()))
        schema = self.schema

        def applySchema(it):
            cls = _create_cls(schema)
            return itertools.imap(cls, it)

        self._lazy_rdd = rdd.mapPartitions(applySchema)

    return self._lazy_rdd

Answer 1

像这样使用方法.rdd ：

rdd = df.rdd

Answer 2

@dapangmao's answer works, but it doesn't give the regular spark RDD, it returns a Row object. @dapangmao的答案有效，但它不会给常规的火花RDD，它会返回一个Row对象。 If you want to have the regular RDD format. 如果你想拥有常规的RDD格式。

Try this: 试试这个：

rdd = df.rdd.map(tuple)

or 要么

rdd = df.rdd.map(list)

Answer 3

Answer given by kennyut/Kistian works very well but to get exact RDD like output when RDD consist of list of attributes eg [1,2,3,4] we can use flatmap command as below, kennyut / Kistian给出的答案非常有效，但是当RDD由属性列表组成时，如果得到RDD就像输出一样，例如[1,2,3,4]我们可以使用如下的flatmap命令，

rdd = df.rdd.flatMap(list)
or 
rdd = df.rdd.flatmap(lambda x: list(x))

如何在pyspark中将DataFrame转换回普通RDD？

问题描述

3 个解决方案

解决方案1
91 2015-03-18 17:36:16

解决方案2
56 已采纳 2016-05-17 21:13:31

解决方案3
4 2018-05-14 17:39:58

如何在pyspark中将DataFrame转换回普通RDD？

问题描述

3 个解决方案

解决方案1 91 2015-03-18 17:36:16

解决方案2 56 已采纳 2016-05-17 21:13:31

解决方案3 4 2018-05-14 17:39:58

解决方案1
91 2015-03-18 17:36:16

解决方案2
56 已采纳 2016-05-17 21:13:31

解决方案3
4 2018-05-14 17:39:58