I have a start date and an end date (df_with_start_end) for a specific id and I try to figure which other dates with the same id from another dataframe (df_dates) are between them. The result should be entered in a new column.
The idea was I iterate over the dataframe df_with_start_end with the unique IDs and for every ID I try to analyse if there are any other dates from df_dates within the start and end date from df_with_start_end.
My implementation is like this,but it doesn't work that way.
for k in df_with_start_end['ID']:
df_with_start_end[k]['FREE_PERIOD'] = df_with_start_end[k]['START_DATE'] <= df_dates[k]['DATE'] < df_with_start_end[k]['END_DATE']
I get this error:
Traceback (most recent call last):
File "/opt/anaconda/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 3685509
Here is an example of the dataFrames:
df_with_start_end
ID START_DATE END_DATE FREE_PERIOD
1 2015-02-13 2016-02-13 False
2 2014-08-27 2015-08-27 True
df_dates
ID DATE
1 2014-04-23
1 2015-08-02
1 2015-09-15
2 2014-06-19
2 2017-01-07
I heard loops are slow in python is there a way to avoid them in my case?
Looks like you wish to iterate over rows but you actually do it over columns.
for k in df_with_start_end['ID']:
means k
is an ID
value.
However df_with_start_end[k]
access the column whose value is k
. Since your columns are only START_DATE END_DATE FREE_PERIOD
you get an error that the value you seek does not exist.
A solution to that would be either to first access the column and then the ID
by switching the order of your call:
df_with_start_end['FREE_PERIOD'][k]
But a nicer way would be to use the loc
function:
df_with_start_end.loc[k, 'FREE_PERIOD']
For me the easiest way was to join the both dataFrames. For this join I used merge(). Then it's much better to compare them. The problem was I tried to avoid to join them, but it looks like it's sometimes the better way.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.