[英]pyspark RDD to DataFrame
I am new to Spark. 我是Spark的新手。
I have a DataFrame and I used the following command to group it by 'userid' 我有一个DataFrame,我使用以下命令按“ userid”将其分组
def test_groupby(df):
return list(df)
high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
lambda row: row.userid).mapValues(test_groupby)
It gives a RDD which in following structure: 它给出了一个RDD,其结构如下:
(326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])
326033430
is the big group. 326033430
是大集团。
My question is how can I convert this RDD back to a DataFrame Structure? 我的问题是如何将该RDD转换回DataFrame结构? If I cannot do that, how I can get values from the Row term?
如果无法做到这一点,如何从行项中获取值?
Thank you. 谢谢。
You should just 你应该只是
from pyspark.sql.functions import *
high_volumn = self.df\
.filter(self.df.outmoney >= 1000)\
.groupBy('userid').agg(collect_list('col'))
and in .agg method pass what You want to do with rest of data. 并在.agg方法中传递您想对其余数据进行的处理。
Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg 请点击以下链接: http : //spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.