繁体   English   中英

在pyspark中创建数据框字典

[英]Create dictionary of dataframe in pyspark

我正在尝试为年份和月份创建字典。 它是一种宏,我可以致电要求否。 年和月。 我在pyspark df中添加动态列时面临挑战

df = spark.createDataFrame([(1, "foo1",'2016-1-31'),(1, "test",'2016-1-31'), (2, "bar1",'2012-1-3'),(4, "foo2",'2011-1-11')], ("k", "v","date"))
w = Window().partitionBy().orderBy(col('date').desc())          
df = df.withColumn("next_date",lag('date').over(w).cast(DateType()))
df = df.withColumn("next_name",lag('v').over(w))
df = df.withColumn("next_date",when(col("k") !=  lag(df.k).over(w),date_add(df.date,605)).otherwise(col('next_date')))
df = df.withColumn("next_name",when(col("k") != lag(df.k).over(w),"").otherwise(col('next_name')))

import copy
dict_of_YearMonth = {}

for yearmonth in [200901,200902,201605 .. etc]:

    key_name = 'Snapshot_'+str(yearmonth)
    dict_of_YearMonth[key_name].withColumn("test",yearmonth)
    dict_of_YearMonth[key_name].withColumn("test_date",to_date(''+yearmonth[:4]+'-'+yearmonth[4:2]+'-1'+''))
 # now i want to add a condition 
  if(dict_of_YearMonth[key_name].test_date >= dict_of_YearMonth[key_name].date) and (test_date <= next_date) then output snapshot_yearmonth  /// i.e dataframe which satisfy this condition i am able to do it in pandas but facing challenge in pyspark
dict_of_YearMonth[key_name]  
dict_of_YearMonth 

然后我想将所有数据帧连接到单个pyspark数据帧中,我可以在pandas中做到这一点,如下所示,但我需要在pyspark中做

  snapshots=pd.concat([dict_of_YearMonth['Snapshot_201104'],dict_of_YearMonth['Snapshot_201105']])

如果还有其他想法可以通过动态添加列来生成动态数据框架的字典,并执行条件并生成基于年份的数据框架并将它们合并到单个数据框架中。 任何帮助,将不胜感激。

我试过下面的代码工作正常

// Function to append all the dataframe using union
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)

// convert dates
def is_date(x):
    try:
        x= str(x)+str('01')
        parse(x)
        return datetime.datetime.strptime(x, '%Y%m%d').strftime("%Y-%m-%d")
    except ValueError:
        pass # if incorrect format, keep trying other format

dict_of_YearMonth = {}
for yearmonth in [200901,200910]:
key_name = 'Snapshot_'+str(yearmonth)
dict_of_YearMonth[key_name]=df
func =  udf(lambda x:  yearmonth, StringType())
dict_of_YearMonth[key_name] = df.withColumn("test",func(col('v')))
default_date = udf (lambda x : is_date(x))
dict_of_YearMonth[key_name] = dict_of_YearMonth[key_name].withColumn("test_date",default_date(col('test')).cast(DateType()))
dict_of_YearMonth  

要添加多个数据帧,请使用以下代码:

final_df = unionAll(dict_of_YearMonth['Snapshot_200901'],  dict_of_YearMonth['Snapshot_200910'])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM