简体   繁体   中英

How to generate date values into a dataframe in PySpark?

May I know what's wrong with below code? it does not print anything.

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, current_timestamp,last_day,next_day, date_format, date_add, year, month, dayofmonth, dayofyear, dayofweek, date_trunc, date_sub, to_date, add_months, weekofyear, quarter, col
from pyspark.sql.types import StructType,StructField,StringType, IntegerType
ss = SparkSession.builder.appName('DateDim').master('local[1]').getOrCreate()
df = ss.createDataFrame([],StructType([]))
current_date()
df = df.select(current_date().alias("current_date"),next_day(current_date(), 'sunday').alias("next_day"),dayofweek(current_date()).alias("day_of_week"),dayofmonth(current_date()).alias("day_of_month"),dayofyear(current_date()).alias("day_of_year"),last_day(current_date()).alias("last_day"),year(current_date()).alias("year"),month(current_date()).alias("month"), weekofyear(current_date()).alias("week_of_year"),quarter(current_date()).alias("quarter")).collect()
print(df)
for i in range(1, 1000):
    print(i)
for i in range(1, 1000):
    v_date = date_add(v_date, i)
    df.unionAll(df.select(v_date.alias("current_date"),next_day(v_date,'sunday').alias("next_day"),dayofweek(v_date).alias("day_of_week"),dayofmonth(v_date).alias("day_of_month"),dayofyear(v_date).alias("day_of_year"),last_day(v_date).alias("last_day"),year(v_date).alias("year"),month(v_date).alias("month"), weekofyear(v_date).alias("week_of_year"),quarter(v_date).alias("quarter")))
df.show()

You're getting zero rows because there are no rows in the initial df . Any columns being created will have no values as there are no rows in df .

It seems you're trying to create a dataframe with 1000 dates starting from the current day. There's a simple approach using sequence function.

data_sdf = spark.createDataFrame([(1,)], 'id string')

data_sdf. \
    withColumn('min_dt', func.current_date().cast('date')). \
    withColumn('max_dt', func.date_add('min_dt', 1000).cast('date')). \
    withColumn('all_dates', func.expr('sequence(min_dt, max_dt, interval 1 day)')). \
    withColumn('dates_exp', func.explode('all_dates')). \
    drop('id'). \
    show(10)

# +----------+----------+--------------------+----------+
# |    min_dt|    max_dt|           all_dates| dates_exp|
# +----------+----------+--------------------+----------+
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-27|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-28|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-29|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-30|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-08-31|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-01|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-02|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-03|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-04|
# |2022-08-27|2025-05-23|[2022-08-27, 2022...|2022-09-05|
# +----------+----------+--------------------+----------+
# only showing top 10 rows

select the dates_exp field for further use.

You want to use the range() function which generates rows (using sequence you will generate an array which you then need to explode into rows).

That's how you can use it:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, next_day, dayofweek, dayofmonth, dayofyear, last_day, year, month, \
    weekofyear, quarter, current_date

spark = SparkSession.builder.getOrCreate()

(
    spark
    .range(0, 1000)
    .alias("id")
    .select(
        (current_date() + col('id').cast("int")).alias("date")
    )
    .select(
        "date",
        next_day("date", 'sunday').alias("next_sunday"),
        dayofweek("date").alias("day_of_week"),
        dayofmonth("date").alias("day_of_month"),
        dayofyear("date").alias("day_of_year"),
        last_day("date").alias("last_day"),
        year("date").alias("year"),
        month("date").alias("month"),
        weekofyear("date").alias("week_of_year"),
        quarter("date").alias("quarter")
    )
).show()

it returns

+----------+-----------+-----------+------------+-----------+----------+----+-----+------------+-------+
|      date|next_sunday|day_of_week|day_of_month|day_of_year|  last_day|year|month|week_of_year|quarter|
+----------+-----------+-----------+------------+-----------+----------+----+-----+------------+-------+
|2022-09-22| 2022-09-25|          5|          22|        265|2022-09-30|2022|    9|          38|      3|
|2022-09-23| 2022-09-25|          6|          23|        266|2022-09-30|2022|    9|          38|      3|
|2022-09-24| 2022-09-25|          7|          24|        267|2022-09-30|2022|    9|          38|      3|
|2022-09-25| 2022-10-02|          1|          25|        268|2022-09-30|2022|    9|          38|      3|
|2022-09-26| 2022-10-02|          2|          26|        269|2022-09-30|2022|    9|          39|      3|
|2022-09-27| 2022-10-02|          3|          27|        270|2022-09-30|2022|    9|          39|      3|
|2022-09-28| 2022-10-02|          4|          28|        271|2022-09-30|2022|    9|          39|      3|
|2022-09-29| 2022-10-02|          5|          29|        272|2022-09-30|2022|    9|          39|      3|
|2022-09-30| 2022-10-02|          6|          30|        273|2022-09-30|2022|    9|          39|      3|
|2022-10-01| 2022-10-02|          7|           1|        274|2022-10-31|2022|   10|          39|      4|
|2022-10-02| 2022-10-09|          1|           2|        275|2022-10-31|2022|   10|          39|      4|
|2022-10-03| 2022-10-09|          2|           3|        276|2022-10-31|2022|   10|          40|      4|
|2022-10-04| 2022-10-09|          3|           4|        277|2022-10-31|2022|   10|          40|      4|
|2022-10-05| 2022-10-09|          4|           5|        278|2022-10-31|2022|   10|          40|      4|
|2022-10-06| 2022-10-09|          5|           6|        279|2022-10-31|2022|   10|          40|      4|
|2022-10-07| 2022-10-09|          6|           7|        280|2022-10-31|2022|   10|          40|      4|
|2022-10-08| 2022-10-09|          7|           8|        281|2022-10-31|2022|   10|          40|      4|
|2022-10-09| 2022-10-16|          1|           9|        282|2022-10-31|2022|   10|          40|      4|
|2022-10-10| 2022-10-16|          2|          10|        283|2022-10-31|2022|   10|          41|      4|
|2022-10-11| 2022-10-16|          3|          11|        284|2022-10-31|2022|   10|          41|      4|
+----------+-----------+-----------+------------+-----------+----------+----+-----+------------+-------+
only showing top 20 rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM