[英]Efficiently looping through pandas dataframe
我有一些房地產數據,我想有效地計算自該房產的上次出售日期以來的TimeDelta。 結果必須高效,因為我有超過200萬行,所以我的解決方案太慢了。 到目前為止,這是我已經實現的內容,但這需要幾天的時間才能在我的數據框中進行計算。 有沒有更快的方法來實現呢?
import pandas as pd
import numpy as np
import datetime #import datetime
pd.set_option('display.max_columns',5)
## Make some dummy data
data_dict = dict(
ADDRESS=[
'123 Main Street', '123 Apple Street', '123 Orange Street', '123 Pineapple Street', '123 Pear Street',
'123 Main Street', '123 Apple Street', '123 Orange Street', '123 Pineapple Street', '123 Pear Street',
'123 Main Street', '123 Apple Street', '123 Orange Street', '123 Pineapple Street', '123 Pear Street',
],
SALE_DATE=[
'2002-01-01', '2006-01-01', '2009-01-01', '2011-01-01', '2012-01-01',
'2013-01-01', '2012-01-01', '2012-01-01', '2012-01-01', '2014-01-01',
'2016-01-01', '2018-06-01', '2017-01-01', '2017-01-01', '2019-01-01'
]
)
# format as a pandas df
sale_data = pd.DataFrame(data_dict)
sale_data['SALE_DATE'] = pd.to_datetime(sale_data['SALE_DATE'])
# instantiate a df that we will append our results to
master_df = pd.DataFrame()
#loop through each address to get the last sale and expected future sale date
for address in enumerate(sale_data.ADDRESS.drop_duplicates()):
df_slice = sale_data[sale_data.ADDRESS == address[1]].sort_values(by='SALE_DATE')
df_slice['days_since_last_sale'] = df_slice['SALE_DATE'] - df_slice['SALE_DATE'].shift(1)
df_slice['days_since_last_sale'] = [x.days if x.days > 0 else np.nan for x in df_slice['days_since_last_sale']]
df_slice['years_since_last_sale'] = df_slice['days_since_last_sale'] / 365
days_average = np.mean(df_slice['days_since_last_sale'])
df_slice['next_sale'] = datetime.datetime.today() + datetime.timedelta(days=days_average)
master_df = pd.concat([df_slice, master_df],
axis=0)
print(len(master_df))
print('_________________________________________________________________________________')
print(master_df)
采用:
#sorting per 2 columns for grouping ADDRESS together and correct diff
sale_data = sale_data.sort_values(by=['ADDRESS','SALE_DATE'])
#get difference per groups, convert timedeltas to days
sale_data['days_since_last_sale'] = sale_data.groupby('ADDRESS')['SALE_DATE'].diff().dt.days
#divide by scalar
sale_data['years_since_last_sale'] = sale_data['days_since_last_sale'] / 365
#get mean per groups
days = sale_data.groupby('ADDRESS')['days_since_last_sale'].transform('mean')
#add to datetime timedeltas of days
sale_data['next_sale'] = datetime.datetime.today() + pd.to_timedelta(days, unit='d')
print(sale_data)
ADDRESS SALE_DATE days_since_last_sale \
1 123 Apple Street 2006-01-01 NaN
6 123 Apple Street 2012-01-01 2191.0
11 123 Apple Street 2018-06-01 2343.0
0 123 Main Street 2002-01-01 NaN
5 123 Main Street 2013-01-01 4018.0
10 123 Main Street 2016-01-01 1095.0
2 123 Orange Street 2009-01-01 NaN
7 123 Orange Street 2012-01-01 1095.0
12 123 Orange Street 2017-01-01 1827.0
4 123 Pear Street 2012-01-01 NaN
9 123 Pear Street 2014-01-01 731.0
14 123 Pear Street 2019-01-01 1826.0
3 123 Pineapple Street 2011-01-01 NaN
8 123 Pineapple Street 2012-01-01 365.0
13 123 Pineapple Street 2017-01-01 1827.0
years_since_last_sale next_sale
1 NaN 2025-09-04 14:37:24.900489
6 6.002740 2025-09-04 14:37:24.900489
11 6.419178 2025-09-04 14:37:24.900489
0 NaN 2026-06-21 02:37:24.900489
5 11.008219 2026-06-21 02:37:24.900489
10 3.000000 2026-06-21 02:37:24.900489
2 NaN 2023-06-21 14:37:24.900489
7 3.000000 2023-06-21 14:37:24.900489
12 5.005479 2023-06-21 14:37:24.900489
4 NaN 2022-12-21 02:37:24.900489
9 2.002740 2022-12-21 02:37:24.900489
14 5.002740 2022-12-21 02:37:24.900489
3 NaN 2022-06-21 14:37:24.900489
8 1.000000 2022-06-21 14:37:24.900489
13 5.005479 2022-06-21 14:37:24.900489
groupby
+ diff()
應該可以正常工作,並且比循環快:
sale_data.groupby('ADDRESS').SALE_DATE.diff()
輸出:
ADDRESS SALE_DATE delta
0 123 Main Street 2002-01-01 NaT
1 123 Apple Street 2006-01-01 NaT
2 123 Orange Street 2009-01-01 NaT
3 123 Pineapple Street 2011-01-01 NaT
4 123 Pear Street 2012-01-01 NaT
5 123 Main Street 2013-01-01 4018 days
6 123 Apple Street 2012-01-01 2191 days
7 123 Orange Street 2012-01-01 1095 days
8 123 Pineapple Street 2012-01-01 365 days
9 123 Pear Street 2014-01-01 731 days
10 123 Main Street 2016-01-01 1095 days
11 123 Apple Street 2018-06-01 2343 days
12 123 Orange Street 2017-01-01 1827 days
13 123 Pineapple Street 2017-01-01 1827 days
14 123 Pear Street 2019-01-01 1826 days
使用Groupby進行轉換並應用差異以獲取日期之間的差異
sale_data['days']= sale_data.groupby(['ADDRESS'],as_index=False)['SALE_DATE'].transform(pd.Series.diff)
ADDRESS SALE_DATE Days
0 123 Main Street 2002-01-01 NaT
1 123 Apple Street 2006-01-01 NaT
2 123 Orange Street 2009-01-01 NaT
3 123 Pineapple Street 2011-01-01 NaT
4 123 Pear Street 2012-01-01 NaT
5 123 Main Street 2013-01-01 4018 days
6 123 Apple Street 2012-01-01 2191 days
7 123 Orange Street 2012-01-01 1095 days
8 123 Pineapple Street 2012-01-01 365 days
9 123 Pear Street 2014-01-01 731 days
10 123 Main Street 2016-01-01 1095 days
11 123 Apple Street 2018-06-01 2343 days
12 123 Orange Street 2017-01-01 1827 days
13 123 Pineapple Street 2017-01-01 1827 days
14 123 Pear Street 2019-01-01 1826 days
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.