I'm new to Python and programming all together. So forgive me if my question seems silly or unclear. I have done research but quite frankly some of the explanations I've read I have trouble understanding.
I have a dataframe that consists of large amounts of scheduled appointment data for a hospital that needs to be evaluated and modified so that it can be imported into their new scheduling application. Unfortunately the vendors' import tool is trash and does zero checks, so I have to write something that will check the old data and transform it into uploaded data for the new system. Here is an example of the format:
start appointment department procedure resource
20171020131500 MAM BDXMAMUNI BDIAG2
20171020133000 MAM BDXMAMUNI BDIAG1
20171020141500 MAM BDXMAMUNI BDIAG2
20171020143000 MAM BDXMAMUNI BDIAG1
20171020144500 MAM BDXMAMBIL BDIAG2
20171020150000 MAM BDXMAMBIL BDIAG1
20171020151500 MAM BDXMAMUNI BDIAG2
20171023080000 MAM BDXMAMBIL BDIAG1
20171023081500 MAM BDXMAMBIL BDIAG2
I'm trying to do updates based on criteria. This is what I came up with but I cannot get it to update the field. Here are the criteria in my own words.
If start appointment at Index X minutes = 15 and (hr = 8 or h= 9 or hr = 10 or hr = 11 or h =13 or hr =14 or hr =15) and resource = BDIAG1, BDIAG2 or BDIAG 3 then start appointment at Index X will be in resource ZBMDX3 at Index X
If start appointment at Index X has minutes = 00 and (hr = 8 or hr = 9 or hr = 10 or hr = 11 or hr = 13 or hr = 14 or hr =15) then start appointment at Index X will be in resource ZBMDX2 at Index X
If start appointment at Index X minutes = 45 and (hr = 7 or hr = 8 or hr = 9 or hr = 10 or hr 12 or hr = 13 or hr = 14) thenstart appointment at Index X will be in resource ZBMDX1 at Index X
If start appointment at Index X the minutes = 30 and (hr = 8 or hr = 9 or hr = 10 or hr = 13 or hr = 14) then start appointment at Index X will be in resource ZBMDX4 at Index X
When the output file is created, it does not have any updated changes. I did some research on StackOverflow but none of the threads I've read seem to work. Some recommended doing some stuff with locs and ix and df.update.
import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet1')
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']
def diagnostic(): # Check Diagnostic Breast scheduled appointments
for i in range(10):
minutes = str(start_appointment[i])[14:16]
hour = str(start_appointment[i])[11:13]
if minutes == '15' and (
hour == '8' or hour == '9' or hour == '10' or hour == '11'
or hour == '13' or hour == '14' or hour == '15') and (
resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG3'):
df.update['resource'][i] = 'ZBMDX3'
elif minutes == '00' and (hour == '8' or hour == '9' or hour == '10' or
hour == '11' or hour == '13' or hour == '14' or hour == '15')
and (resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG2'):
df.update['resource'][i] = 'ZBMDX2'
elif minutes == '45' and (
hour == '7' or hour == '8' or hour == '9' or hour == '10' or
hour == '12' or hour == '13' or hour == '14') and (
resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG1'):
df.update['resource'][i] = 'ZBMDX1'
elif minutes == '30' and (hour == '8' or hour == '9' or hour == '10' or
hour == '13' or hour == '14') and (
resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG1'):
df.update['resource'][i] = 'ZBMDX4'
diagnostic()
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
df2 = diagnostic(df)
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
Now i'm getting error. Traceback (most recent call last): File "Excel Parse.py", line 55, in df2.to_excel(writer, 'Sheet1') AttributeError: 'NoneType' object has no attribute 'to_excel' Exception ignored in: > Traceback (most recent call last): File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\xlsxwriter\\workbook.py", line 153, in del Exception: Exception caught in workbook destructor. Explicit close() may be required for workbook.
import pandas as pd
my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')
def update_val(row):
minutes = str(row['start appointment'])[14:16]
hour = str(row['start appointment'])[11:13]
resource = row['resource']
# cond1, cond2, cond3, cond4 = True, False, False, False
# Condition 1
if minutes == '00' and hour in ['8', '9', '10', '11', '13', '14', '15']
and resource in ['BDIAG1', 'BDIAG2', 'BDIAG3'] == True:
row['resource'] = 'ZBMDX2'
# Condition 2
elif minutes == '15' and hour in ['9', '10','11','13','14','15']
and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
row['resource'] = 'ZBMDX3'
# Condition 3
elif minutes == '45' and hour in ['7','8','9','10','12','13','14']
and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
row['resource'] = 'ZBMDX1'
# Condition 4
elif minutes == '30' and hour in ['8','9','10','13','14']
and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
When the output file is created, I'm still seeing no updates to the resource fields. I evaluated the first 10 rows manually to make sure that the criteria exist not being met and maybe it is working but the criteria exists.
start appointment dept procedure resource
20171020131500 MAM BDXMAMUNI BDIAG2 should change to ZBMDX3
20171020133000 MAM BDXMAMUNI BDIAG1 should change to ZBMDX4
20171020141500 MAM BDXMAMUNI BDIAG2 should change to ZBMDX3
20171020143000 MAM BDXMAMUNI BDIAG1 should change to ZBMDX4
20171020144500 MAM BDXMAMBIL BDIAG2 should change to ZBMDX1
import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet3')
# Pull Columns as a Variable
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']
def diagnostic(df):
for i in range(1,100):
minutes = str(start_appointment[i])[14:16]
hour = str(start_appointment[i])[11:13]
if minutes == '15' and hour in ['9', '10','11','13','14','15'] and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX3'
elif minutes == '00' and hour in ['8','9','10','11','13','14','15'] and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX2'
elif minutes == '45' and hour in ['7','8','9','10','12','13','14'] and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX1'
elif minutes == '30' and hour in ['8','9','10','13','14'] and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX4'
return df
df2 = diagnostic(df)
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
Same issue. No updates to output file.
Still isn't showing updates in the output. At this point I'm wondering if I should save the xlsx file as a CSV and not use any libraries, or if i should just create the data frame from scratch by iterating over each column (start appointment, resource) into their own respective lists. What do you think?
import pandas as pd
my_file = 'C:\\Users\cboutsikos\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')
def update_val(row):
minutes = str(row['start appointment'])[10:12]
hour = str(row['start appointment'])[8:10]
resource = row['resource']
# Condition 1
if (minutes == '00') and (hour in ['8', '9', '10', '11', '13', '14', '15']) \
and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']) == True:
row['resource'] = 'ZBMDX2'
# Condition 2
elif (minutes == '15') and (hour in ['9', '10','11','13','14','15']) \
and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX3'
# Condition 3
elif (minutes == '45') and (hour in ['7','8','9','10','12','13','14']) \
and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX1'
# Condition 4
elif (minutes == '30') and (hour in ['8','9','10','13','14']) \
and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
print(df2.head())
Okay a few things..
Your function diagnostic
makes changes to the global df
but it doesn't accept a DataFrame nor does it return anything. So when you call it with df2 = diagnostic(df)
, you are not feeding df
into it and you are not returning the modified DataFrame but a NoneType
. That is why you are getting the error telling you that df2
is not a pd.DataFrame object, and therefore it has no attribute 'to_excel'.
It would be better if your function accepted df
as an input, made changes to it, and returned the modified df
as an output.
You only need to make two modifications:
1) include df
as an argument in the first line: def diagnostic(df):
2) include return df
as your last line.
Something like:
def diagnostic(df): # Check Diagnostic Breast scheduled appointments
for i in range(10):
...
...
df.loc[i, 'resource'] = 'ZBMDX4' # see explanation below.
return df
Another issue is that you should probably be using df.loc[row, col] = new_val
to update your values. df.update()
accepts DataFrames (or objects coercible into DataFrames, from the doc), whereas you are updating one value at a time.
Another issue is that your conditions can be simplified. Rather than writing hour == x1 or hour == x2 or ....
you can place possible values in a list and check for membership. Something like hour in [x1, x2, ...]
.
Since there is a lot to unpack here, I wrote a skeleton of what I'm talking about:
Solution 1
def diagnostic(df): # Check Diagnostic Breast scheduled appointments
for i in range(10):
minutes = str(start_appointment[i])[10:12]
hour = str(start_appointment[i])[8:10]
if condition_1:
df.loc[i, 'resource'] = 'ZBMDX3'
elif condition_2:
df.loc[i, 'resource'] = 'ZBMDX2'
elif condition_3:
df.loc[i, 'resource'] = 'ZBMDX1'
elif condition_3:
df.loc[i, 'resource'] = 'ZBMDX4'
return(df)
df2 = diagnostic(df)
and each condition would be your logic (something like condition_1 = if (minutes == '15') and hour in ['09', '10', '11'])
etc
Solution 2
Another way to do it is to make a function that makes changes to each row based on some logic, and then apply this to your DataFrame. Something like the following:
def update_val(row):
minutes = str(row['start appointment'])[10:12]
hour = str(row['start appointment'])[8:10]
resource = row['resource']
cond1, cond2, cond3, cond4 = True, False, False, False
if cond1:
row['resource'] = 'ZBMDX3'
elif cond2:
row['resource'] = 'ZBMDX2'
elif cond3:
row['resource'] = 'ZBMDX1'
elif cond4:
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
where obviously you would update your conditional logic where I have put in the dummy conditions cond1
etc.
I prefer solution 2, as it is cleaner and easier to keep track of changes. It is also in general more performant (though I haven't verified in this particular case).
I don't have enough points to comment on your question. So I'll just post a modified version of your code that should work:
import pandas as pd
my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')
def update_val(row):
def time_range(start,stop):
return [str(el).zfill(2) for el in range(start,stop+1)]
minutes = str(row['start appointment'])[14:16] # [10:12] in sample data
hour = str(row['start appointment'])[11:13] # [8:10] in sample data
resource = row['resource']
# Condition 1
if (minutes == '00') and (hour in time_range(8,15)) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
row['resource'] = 'ZBMDX2'
# Condition 2
elif (minutes == '15') and (hour in time_range(9,15)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX3'
# Condition 3
elif (minutes == '45') and (hour in time_range(7,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX1'
# Condition 4
elif (minutes == '30') and (hour in time_range(8,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
print(df2.head())
I made two changes.
1) placed the sub-conditions in parentheses. I believe they were formatted incorrectly in your original formulation and so they never evaluated to True
.
2) Changed the indexing of the start appointment
row. Based on your sample data the original indexing was returning an empty str, and therefore never evaluating to any of the options.
ps you can just print the first first 5 rows to console to check if the values updated, rather than writing to disk each time.
Ok I've looked at sample data now and found the issue. There was trailing whitespace in the resource
column, causing the logic to fail. This can be simply removed by using str.strip()
. Also, the start appointment
field is being parsed as a pandas.tslib.Timestamp
object, which simplifies our logic by being able to extract the minute
and hour
tokens as int
s. The following should work:
def update_val(row):
minutes = row['start appointment'].minute
hour = row['start appointment'].hour
resource = row['resource'].strip()
# Condition 1
if (minutes == 0) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
new_resource = 'ZBMDX2'
# Condition 2
elif (minutes == 15) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
new_resource = 'ZBMDX3'
# Condition 3
elif (minutes == 45) and (hour in [7,8,9,10,12,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
new_resource = 'ZBMDX1'
# Condition 4
elif (minutes == 30) and (hour in [8,9,10,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
new_resource = 'ZBMDX4'
else:
new_resource = resource
row['resource'] = new_resource
return row
df2 = df.apply(update_val, axis='columns')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.