简体   繁体   English

如何在pyspark中将DataFrame转换回普通RDD?

[英]How to convert a DataFrame back to normal RDD in pyspark?

I need to use the 我需要使用

(rdd.)partitionBy(npartitions, custom_partitioner)

method that is not available on the DataFrame. DataFrame上不可用的方法。 All of the DataFrame methods refer only to DataFrame results. 所有DataFrame方法仅引用DataFrame结果。 So then how to create an RDD from the DataFrame data? 那么如何从DataFrame数据创建RDD呢?

Note: this is a change (in 1.3.0) from 1.2.0. 注意:这是从1.2.0开始的更改(在1.3.0中)。

Update from the answer from @dpangmao: the method is .rdd. 从@dpangmao的答案更新 :方法是.rdd。 I was interested to understand if (a) it were public and (b) what are the performance implications. 我有兴趣了解(a)它是否公开以及(b)性能影响是什么。

Well (a) is yes and (b) - well you can see here that there are significant perf implications: a new RDD must be created by invoking mapPartitions : 那么(a)是肯定的和(b) - 你可以在这里看到有重要的性能影响:必须通过调用mapPartitions创建一个新的RDD:

In dataframe.py (note the file name changed as well (was sql.py): dataframe.py中 (注意文件名也改变了(是sql.py):

@property
def rdd(self):
    """
    Return the content of the :class:`DataFrame` as an :class:`RDD`
    of :class:`Row` s.
    """
    if not hasattr(self, '_lazy_rdd'):
        jrdd = self._jdf.javaToPython()
        rdd = RDD(jrdd, self.sql_ctx._sc, BatchedSerializer(PickleSerializer()))
        schema = self.schema

        def applySchema(it):
            cls = _create_cls(schema)
            return itertools.imap(cls, it)

        self._lazy_rdd = rdd.mapPartitions(applySchema)

    return self._lazy_rdd

像这样使用方法.rdd

rdd = df.rdd

@dapangmao's answer works, but it doesn't give the regular spark RDD, it returns a Row object. @dapangmao的答案有效,但它不会给常规的火花RDD,它会返回一个Row对象。 If you want to have the regular RDD format. 如果你想拥有常规的RDD格式。

Try this: 试试这个:

rdd = df.rdd.map(tuple)

or 要么

rdd = df.rdd.map(list)

Answer given by kennyut/Kistian works very well but to get exact RDD like output when RDD consist of list of attributes eg [1,2,3,4] we can use flatmap command as below, kennyut / Kistian给出的答案非常有效,但是当RDD由属性列表组成时,如果得到RDD就像输出一样例如[1,2,3,4]我们可以使用如下的flatmap命令,

rdd = df.rdd.flatMap(list)
or 
rdd = df.rdd.flatmap(lambda x: list(x))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM