pyspark RDD到DataFrame

Question

I am new to Spark. 我是Spark的新手。

I have a DataFrame and I used the following command to group it by 'userid' 我有一个DataFrame，我使用以下命令按“ userid”将其分组

def test_groupby(df):
    return list(df)

high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
                    lambda row: row.userid).mapValues(test_groupby)

It gives a RDD which in following structure: 它给出了一个RDD，其结构如下：

 (326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])

326033430 is the big group. 326033430是大集团。

My question is how can I convert this RDD back to a DataFrame Structure? 我的问题是如何将该RDD转换回DataFrame结构？ If I cannot do that, how I can get values from the Row term? 如果无法做到这一点，如何从行项中获取值？

Thank you. 谢谢。

Answer 1

You should just 你应该只是

from pyspark.sql.functions import *
high_volumn = self.df\
            .filter(self.df.outmoney >= 1000)\
            .groupBy('userid').agg(collect_list('col'))

and in .agg method pass what You want to do with rest of data. 并在.agg方法中传递您想对其余数据进行的处理。

Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg 请点击以下链接： http : //spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg

pyspark RDD到DataFrame

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-06-27 12:44:11

pyspark RDD到DataFrame

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-06-27 12:44:11

解决方案1
0 已采纳 2017-06-27 12:44:11