简体   繁体   English

pyspark RDD到DataFrame

[英]pyspark RDD to DataFrame

I am new to Spark. 我是Spark的新手。

I have a DataFrame and I used the following command to group it by 'userid' 我有一个DataFrame,我使用以下命令按“ userid”将其分组

def test_groupby(df):
    return list(df)

high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
                    lambda row: row.userid).mapValues(test_groupby)

It gives a RDD which in following structure: 它给出了一个RDD,其结构如下:

 (326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])

326033430 is the big group. 326033430是大集团。

My question is how can I convert this RDD back to a DataFrame Structure? 我的问题是如何将该RDD转换回DataFrame结构? If I cannot do that, how I can get values from the Row term? 如果无法做到这一点,如何从行项中获取值?

Thank you. 谢谢。

You should just 你应该只是

from pyspark.sql.functions import *
high_volumn = self.df\
            .filter(self.df.outmoney >= 1000)\
            .groupBy('userid').agg(collect_list('col'))

and in .agg method pass what You want to do with rest of data. 并在.agg方法中传递您想对其余数据进行的处理。

Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg 请点击以下链接: http : //spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM