Python PySpark substract 1 year from given end date to work with one year of data range

Question

What I wanted to do is get 1 year of data. By calculate latest date from the column date, as my end date. Then use the end date - 1 year to get the start date. After that, I can filter the data in between those start and end date.

I did manage to get the end date, but can't find how I can get the start date.

Below is the code that I have used so far. -1 year is what needs to be solved. and if you know how to filter in pyspark is also welcome.

from pyspark.sql.functions import min, max
import datetime
import pyspark.sql.function as F
from pyspark.sql.functions import date_format, col



#convert string to date type 
df = df.withColumn('risk_date', F.to_date(F.col('chosen_risk_prof_date'), 'dd.MM.yyyy'))

#filter only 1 year of data from big data set. 
#calculate the start date and end date. lastest_date = end end.

latest_date = df.select((max("risk_date"))).show()
start_date = latest_date - *1 year*
new_df = df.date > start_date & df.date < end_date

Then after this get all the data between start date and end date

Answer 1

you can use relativedelta as below

from datetime import datetime
from dateutil.relativedelta import relativedelta
print(datetime.now() - relativedelta(years=1))

Python PySpark substract 1 year from given end date to work with one year of data range

Question

1 answers

solution1
0 2022-05-20 23:34:06

Python PySpark substract 1 year from given end date to work with one year of data range

Question

1 answers

solution1 0 2022-05-20 23:34:06

solution1
0 2022-05-20 23:34:06