I have the following function which makes an initial DataFrame, iterates over a dict of functions and concatenates a dataframe to the initial one with each iteration:
def get_variables_daily(start_date='1990-06-08', end_date='2015-05-04', describe=False):
variables_daily = {"DP": [1, get_DP_daily], "PE": [1, get_PE_daily], "BM": [1, get_BM_daily], "CAPE": [1, get_CAPE_daily],
"PCAprice": [1, get_PCAprice_daily], "BY": [1, get_BY_daily], "DEF": [1, get_DEF_daily],
"TERM": [1, get_TERM_daily], "CAY": [1, get_CAY_daily], "SIM": [1, get_SIM_daily], "VRP": [1, get_VRP_daily],
"IC": [0, get_IC_daily], "BDI": [1, get_BDI_daily], "NOS": [1, get_NOS_daily], "CPI": [1, get_CPI_daily],
"PCR": [1, get_PCR_daily], "MA": [1, get_MA_daily], "PCAtech": [0, get_PCAtech_daily],
"OIL": [1, get_OIL_daily], "SI": [1, get_SI_daily]}
start_date = pd.to_datetime(start_date, yearfirst=True)
end_date = pd.to_datetime(end_date, yearfirst=True)
#create initial timeseries
SPXR_1M = get_SPXR_daily(22, '1990-06-08', '2015-05-04')
SPXR_3M = get_SPXR_daily(65, '1990-06-08', '2015-05-04')
SPXR_6M = get_SPXR_daily(130, '1990-06-08', '2015-05-04')
SPXR_12M = get_SPXR_daily(252, '1990-06-08', '2015-05-04')
df1 = pd.concat([SPXR_1M, SPXR_3M, SPXR_6M, SPXR_12M], axis=1)
#iterate over variables
for key in variables_daily.keys():
#check if variable should be used
check = variables_daily[key][0]
if check == 1:
df2 = variables_daily[key][1](start_date, end_date).convert_objects(convert_numeric=True)
df1 = pd.concat([df1, df2], axis=1)
return df
As you can see, SPXR_1M, SPXR_3M, SPXR_6M, and SPXR_12M are my base for the DataFrame, meaning I should have NO more rows than SPXR_1M should have. However, if you look at the summary of the final DF:
count mean std min 25% 50% \
DP 5706.0 0.018063 0.004894 0.008400 0.014900 0.017900
PE 6497.0 19.750139 4.267477 10.949800 16.581000 18.395300
BM 6497.0 0.371955 0.088411 0.192378 0.323687 0.369440
CAPE 6275.0 25.824579 6.981803 11.849780 20.973447 24.878816
PCAprice 5706.0 -3.125544 3.082865 -17.258065 -4.958795 -2.354091
BY 6249.0 0.977558 0.105177 0.566707 0.915942 0.972425
DEF 6231.0 0.954645 0.413315 0.430000 0.700000 0.870000
TERM 6485.0 1.865422 1.158198 -0.989000 0.916300 1.994900
CAY 6275.0 0.000324 0.016228 -0.031944 -0.012895 -0.002148
SIM 6252.0 0.742821 0.324054 0.007692 0.484615 0.976923
VRP 6272.0 0.066305 0.038604 -0.141507 0.042648 0.059513
BDI 6246.0 0.044917 0.324404 -0.900719 -0.133461 0.012865
NOS 6191.0 0.010533 0.043129 -0.193359 -0.011152 0.010640
CPI 6275.0 0.023318 0.011918 -0.020422 0.016667 0.024161
PCR 6275.0 -1.361110 0.363751 -2.260664 -1.609558 -1.412000
MA 6497.0 0.769432 0.421229 0.000000 1.000000 1.000000
OIL 6252.0 0.012821 0.179430 -1.132002 -0.079546 0.029171
SI 2226.0 3.689411 0.744130 1.952062 3.182868 3.719568
SPXR_22D 6253.0 0.007297 0.045672 -0.297937 -0.016283 0.010915
SPXR_65D 6210.0 0.022014 0.076810 -0.409638 -0.013432 0.028260
SPXR_130D 6145.0 0.046397 0.113950 -0.474598 -0.003915 0.055771
SPXR_252D 6023.0 0.091534 0.169579 -0.488228 0.028048 0.110181
75% max
DP 0.020300 0.040000
PE 23.155400 30.720600
BM 0.437982 0.688610
CAPE 27.720910 47.255292
PCAprice -0.651261 0.001848
BY 1.042780 1.457659
DEF 1.050000 3.500000
TERM 2.793000 3.863000
CAY 0.012844 0.031044
SIM 1.000000 1.000000
VRP 0.082342 0.372942
BDI 0.196221 2.320175
NOS 0.033015 0.253546
CPI 0.029222 0.059571
PCR -1.140890 -0.333101
MA 1.000000 1.000000
OIL 0.119024 0.771293
SI 4.202039 5.760021
SPXR_22D 0.033889 0.224057
SPXR_65D 0.067303 0.388187
SPXR_130D 0.112865 0.541292
SPXR_252D 0.200745 0.685735
You can see that the observations are not consistent, while they should essentially all be 6253 or a little less if they have less rows. Is my concatenation not handling extra rows in the appended dataframes properly? EDIT: It seems like there are many gaps in my initial columns after all the concatenations. Is there any way to make panda only add rows from dataframe B that dataframe A already has?
I got severla thoughts:
pd.concat
will mix the dataframe indicies and produce NA for the missing values, try setting join=inner
parameter
if you can reorganise data creation by column - might try also merge
instead https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html , likely you need left merge
do not shut the NA values with .dropna()
- they may reveal which extra rows are inserted
try shutting the second concat call - so that you can decide which of the two is causing trouble
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.