I got dataframe with a history of making contracts with employees. Employees may appear in the records several times. Target documents represents in 3 types. The goal is to calculate time that particular employee was working in company. I found a solution. but the execution time of the code is almost 2 hours. Is there any more fast and convinient way to do this?
Original table is about 200000+ rows
Here is sample of its structure:
import pandas as pd
df = pd.DataFrame({
'name': ['John Johnson', 'John Johnson', 'John Johnson', 'John Johnson', 'Tom Thompson', 'Tom Thompson',
'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens',
'Tom Thompson', 'Tom Thompson', 'Tom Thompson', 'Tom Thompson'],
'doc_type': ['opening_document','any_other_document','any_other_document','closing_document2','opening_document','any_other_document',
'opening_document','any_other_document','closing_document1','opening_document','closing_document2',
'any_other_document','closing_document1','any_other_document','opening_document'],
'date': pd.to_datetime(['2017-1-1', '2017-1-2', '2017-1-10', '2017-1-15', '2017-1-16', '2017-1-17',
'2018-1-2', '2018-1-10', '2018-1-15', '2018-1-16', '2018-1-30',
'2017-2-1', '2017-2-4', '2017-3-10', '2017-5-15'])
})
# sort by date
df = df.sort_values(by='date').reset_index(drop=True)
Output:
+----+---------------+--------------------+---------------------+
| | name | doc_type | date |
|----+---------------+--------------------+---------------------|
| 0 | John Johnson | opening_document | 2017-01-01 00:00:00 |
| 1 | John Johnson | any_other_document | 2017-01-02 00:00:00 |
| 2 | John Johnson | any_other_document | 2017-01-10 00:00:00 |
| 3 | John Johnson | closing_document2 | 2017-01-15 00:00:00 |
| 4 | Tom Thompson | opening_document | 2017-01-16 00:00:00 |
| 5 | Tom Thompson | any_other_document | 2017-01-17 00:00:00 |
| 6 | Tom Thompson | any_other_document | 2017-02-01 00:00:00 |
| 7 | Tom Thompson | closing_document1 | 2017-02-04 00:00:00 |
| 8 | Tom Thompson | any_other_document | 2017-03-10 00:00:00 |
| 9 | Tom Thompson | opening_document | 2017-05-15 00:00:00 |
| 10 | Steve Stevens | opening_document | 2018-01-02 00:00:00 |
| 11 | Steve Stevens | any_other_document | 2018-01-10 00:00:00 |
| 12 | Steve Stevens | closing_document1 | 2018-01-15 00:00:00 |
| 13 | Steve Stevens | opening_document | 2018-01-16 00:00:00 |
| 14 | Steve Stevens | closing_document2 | 2018-01-30 00:00:00 |
+----+---------------+--------------------+---------------------+
I need to calculate difference in time between opening_document and ( closing_document1 or closing_document2 ) All of the documents (not only target one) represents like rows
%%time
# since name is not enough for correct JOIN we need to make a new unique key
# logic is based on information according to which before closing doc_type there always opening type (because you cant lay off who you not hired yet)
df['key'] = np.nan # create new empty column
count_key = 0 # key counter
df['key'][count_key] = count_key # assign key 0 for row 0
for i in range(1, len(df)): # start with row 1
store = df['doc_type'][i]
if store != 'opening_document':
df['key'][i] = count_key # if row is NOT 'opening_document' then keep key the same
else:
count_key += 1 # else change key
df['key'][i] = count_key # and assing it for current row
# just statusbar for make sure that something happening
sys.stdout.write('\r')
sys.stdout.write("[%-20s] %d%%" % ('='*round(20*(i/(len(df)-1))), (100/(len(df)-1))*i))
sys.stdout.flush()
print('\n')
In original dataframe Wall time: 1h 29min 53s
It gives us an additional key by which you can unambiguously determine how to JOIN
+----+---------------+--------------------+---------------------+-------+
| | name | doc_type | date | key |
|----+---------------+--------------------+---------------------+-------|
| 0 | John Johnson | opening_document | 2017-01-01 00:00:00 | 0 |
| 1 | John Johnson | any_other_document | 2017-01-02 00:00:00 | 0 |
| 2 | John Johnson | any_other_document | 2017-01-10 00:00:00 | 0 |
| 3 | John Johnson | closing_document2 | 2017-01-15 00:00:00 | 0 |
| 4 | Tom Thompson | opening_document | 2017-01-16 00:00:00 | 1 |
| 5 | Tom Thompson | any_other_document | 2017-01-17 00:00:00 | 1 |
| 6 | Tom Thompson | any_other_document | 2017-02-01 00:00:00 | 1 |
| 7 | Tom Thompson | closing_document1 | 2017-02-04 00:00:00 | 1 |
| 8 | Tom Thompson | any_other_document | 2017-03-10 00:00:00 | 1 |
| 9 | Tom Thompson | opening_document | 2017-05-15 00:00:00 | 2 |
| 10 | Steve Stevens | opening_document | 2018-01-02 00:00:00 | 3 |
| 11 | Steve Stevens | any_other_document | 2018-01-10 00:00:00 | 3 |
| 12 | Steve Stevens | closing_document1 | 2018-01-15 00:00:00 | 3 |
| 13 | Steve Stevens | opening_document | 2018-01-16 00:00:00 | 4 |
| 14 | Steve Stevens | closing_document2 | 2018-01-30 00:00:00 | 4 |
+----+---------------+--------------------+---------------------+-------+
Merge for "convert" rows into cols by name and new key then calculate difference between open and close in days
df_merged = pd.merge(df.loc[df['doc_type']=='opening_document'],
df.loc[df['doc_type'].isin(['closing_document1','closing_document2'])],
on=['name','key'],
how='left')
df_merged['time_diff'] = df_merged['date_y'] - df_merged['date_x']
Final correct output:
name doc_type_x date_x key doc_type_y date_y time_diff
-- ------------- ---------------- ------------------- ----- ----------------- ------------------- ----------------
0 John Johnson opening_document 2017-01-01 00:00:00 0 closing_document2 2017-01-15 00:00:00 14 days 00:00:00
1 Tom Thompson opening_document 2017-01-16 00:00:00 1 closing_document1 2017-02-04 00:00:00 19 days 00:00:00
2 Tom Thompson opening_document 2017-05-15 00:00:00 2 nan NaT NaT
3 Steve Stevens opening_document 2018-01-02 00:00:00 3 closing_document1 2018-01-15 00:00:00 13 days 00:00:00
4 Steve Stevens opening_document 2018-01-16 00:00:00 4 closing_document2 2018-01-30 00:00:00 14 days 00:00:00
The best solution I found without using a loop is diff() method But it turns out that we cant know which "block" we subtract
Instead of loop do this:
df1 = df.loc[df['doc_type'].isin(['opening_document','closing_document1','closing_document2'])].sort_values(by='date').reset_index(drop=True)
df1['diff'] = df1['date'].diff(-1)*(-1)
df1 = df1[df1['doc_type']=='opening_document'].reset_index(drop=True)
Output:
+----+---------------+------------------+---------------------+-------------------+
| | name | doc_type | date | diff |
|----+---------------+------------------+---------------------+-------------------|
| 0 | John Johnson | opening_document | 2017-01-01 00:00:00 | 14 days 00:00:00 |
| 1 | Tom Thompson | opening_document | 2017-01-16 00:00:00 | 19 days 00:00:00 |
| 2 | Tom Thompson | opening_document | 2017-05-15 00:00:00 | 232 days 00:00:00 |
| 3 | Steve Stevens | opening_document | 2018-01-02 00:00:00 | 13 days 00:00:00 |
| 4 | Steve Stevens | opening_document | 2018-01-16 00:00:00 | 14 days 00:00:00 |
+----+---------------+------------------+---------------------+-------------------+
Value in row with index 2 is wrong. there was no any closing document.
To improve the performance of what you do in the loop for
, you can do it using shift
on the column 'name'
to find where it changes, or where 'opening_document'
is in 'doc_type', plus use cumsum
to increment the value such as:
df['key'] = ((df.name != df.name.shift())|(df.doc_type == 'opening_document')).cumsum()
Then using merge
as you do is probably efficient enough. if you want the key to start at 0, just add -1
at the end of the code above
EDIT: as each time the name changes, the value in 'doc_type' is opening_document
, it is possible to only keep the second condition such as:
df['key'] = (df.doc_type == 'opening_document').cumsum()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.