简体   繁体   中英

Python PySpark substract 1 year from given end date to work with one year of data range

What I wanted to do is get 1 year of data. By calculate latest date from the column date, as my end date. Then use the end date - 1 year to get the start date. After that, I can filter the data in between those start and end date.

I did manage to get the end date, but can't find how I can get the start date.

Below is the code that I have used so far. -1 year is what needs to be solved. and if you know how to filter in pyspark is also welcome.

from pyspark.sql.functions import min, max
import datetime
import pyspark.sql.function as F
from pyspark.sql.functions import date_format, col



#convert string to date type 
df = df.withColumn('risk_date', F.to_date(F.col('chosen_risk_prof_date'), 'dd.MM.yyyy'))

#filter only 1 year of data from big data set. 
#calculate the start date and end date. lastest_date = end end.

latest_date = df.select((max("risk_date"))).show()
start_date = latest_date - *1 year*
new_df = df.date > start_date & df.date < end_date 

Then after this get all the data between start date and end date

you can use relativedelta as below

from datetime import datetime
from dateutil.relativedelta import relativedelta
print(datetime.now() - relativedelta(years=1))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM