简体   繁体   中英

How can I update my DataFrame in Pandas and export out to Excel?

I'm new to Python and programming all together. So forgive me if my question seems silly or unclear. I have done research but quite frankly some of the explanations I've read I have trouble understanding.

I have a dataframe that consists of large amounts of scheduled appointment data for a hospital that needs to be evaluated and modified so that it can be imported into their new scheduling application. Unfortunately the vendors' import tool is trash and does zero checks, so I have to write something that will check the old data and transform it into uploaded data for the new system. Here is an example of the format:

start appointment   department  procedure   resource
20171020131500      MAM         BDXMAMUNI   BDIAG2    
20171020133000      MAM         BDXMAMUNI   BDIAG1    
20171020141500      MAM         BDXMAMUNI   BDIAG2    
20171020143000      MAM         BDXMAMUNI   BDIAG1    
20171020144500      MAM         BDXMAMBIL   BDIAG2    
20171020150000      MAM         BDXMAMBIL   BDIAG1    
20171020151500      MAM         BDXMAMUNI   BDIAG2    
20171023080000      MAM         BDXMAMBIL   BDIAG1    
20171023081500      MAM         BDXMAMBIL   BDIAG2       

I'm trying to do updates based on criteria. This is what I came up with but I cannot get it to update the field. Here are the criteria in my own words.

If start appointment at Index X minutes = 15 and (hr = 8 or h= 9 or hr = 10 or hr = 11 or h =13 or hr =14 or hr =15) and resource = BDIAG1, BDIAG2 or BDIAG 3 then start appointment at Index X will be in resource ZBMDX3 at Index X

If start appointment at Index X has minutes = 00 and (hr = 8 or hr = 9 or hr = 10 or hr = 11 or hr = 13 or hr = 14 or hr =15) then start appointment at Index X will be in resource ZBMDX2 at Index X

If start appointment at Index X minutes = 45 and (hr = 7 or hr = 8 or hr = 9 or hr = 10 or hr 12 or hr = 13 or hr = 14) thenstart appointment at Index X will be in resource ZBMDX1 at Index X

If start appointment at Index X the minutes = 30 and (hr = 8 or hr = 9 or hr = 10 or hr = 13 or hr = 14) then start appointment at Index X will be in resource ZBMDX4 at Index X

When the output file is created, it does not have any updated changes. I did some research on StackOverflow but none of the threads I've read seem to work. Some recommended doing some stuff with locs and ix and df.update.

  import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet1')

  dept = df['department']
  resource = df['resource']
  start_appointment = df['start appointment']


  def diagnostic():  # Check Diagnostic Breast scheduled appointments
      for i in range(10):
          minutes = str(start_appointment[i])[14:16]
          hour = str(start_appointment[i])[11:13]
          if minutes == '15' and (
                  hour == '8' or hour == '9' or hour == '10' or hour == '11'             
            or hour == '13' or hour == '14' or hour == '15') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG3'):
        df.update['resource'][i] = 'ZBMDX3'
    elif minutes == '00' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '11' or hour == '13' or hour == '14' or hour == '15') 
            and (resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG2'):
        df.update['resource'][i] = 'ZBMDX2'
    elif minutes == '45' and (
            hour == '7' or hour == '8' or hour == '9' or hour == '10' or 
            hour == '12' or hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX1'
    elif minutes == '30' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX4'
  diagnostic()

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

I made the changes recommended.

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

Now i'm getting error. Traceback (most recent call last): File "Excel Parse.py", line 55, in df2.to_excel(writer, 'Sheet1') AttributeError: 'NoneType' object has no attribute 'to_excel' Exception ignored in: > Traceback (most recent call last): File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\xlsxwriter\\workbook.py", line 153, in del Exception: Exception caught in workbook destructor. Explicit close() may be required for workbook.

Seiji, I completely updated my code to reflect your changes. Let's look at Solution 2 as that one processed quicker.

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[14:16]
    hour = str(row['start appointment'])[11:13]
    resource = row['resource']
    # cond1, cond2, cond3, cond4 = True, False, False, False
    # Condition 1
    if minutes == '00' and hour in ['8', '9', '10', '11', '13', '14', '15']
        and resource in ['BDIAG1', 'BDIAG2', 'BDIAG3'] == True:
    row['resource'] = 'ZBMDX2'
    # Condition 2
    elif minutes == '15' and  hour in ['9', '10','11','13','14','15']
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif minutes == '45' and hour in ['7','8','9','10','12','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
    row['resource'] = 'ZBMDX1'
    # Condition 4
    elif minutes == '30' and hour in ['8','9','10','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

When the output file is created, I'm still seeing no updates to the resource fields. I evaluated the first 10 rows manually to make sure that the criteria exist not being met and maybe it is working but the criteria exists.

start appointment dept      procedure   resource
20171020131500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020133000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020141500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020143000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020144500    MAM       BDXMAMBIL   BDIAG2    should change to ZBMDX1

Solution 1 by Seiji

import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet3')
# Pull Columns as a Variable
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']

def diagnostic(df):
    for i in range(1,100):
        minutes = str(start_appointment[i])[14:16]
        hour = str(start_appointment[i])[11:13]
        if minutes == '15' and  hour in ['9', '10','11','13','14','15'] and     resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX3'
        elif minutes == '00' and hour in ['8','9','10','11','13','14','15']     and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX2'
        elif minutes == '45' and hour in ['7','8','9','10','12','13','14']     and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX1'
        elif minutes == '30' and hour in ['8','9','10','13','14'] and     resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX4'
    return df

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

Same issue. No updates to output file.

Modified Slicing of Hour and Minute

Still isn't showing updates in the output. At this point I'm wondering if I should save the xlsx file as a CSV and not use any libraries, or if i should just create the data frame from scratch by iterating over each column (start appointment, resource) into their own respective lists. What do you think?

import pandas as pd

my_file = 'C:\\Users\cboutsikos\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[10:12]
    hour = str(row['start appointment'])[8:10]
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in ['8', '9', '10', '11', '13', '14',     '15']) \
         and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']) == True:
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and  (hour in ['9', '10','11','13','14','15']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in ['7','8','9','10','12','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in ['8','9','10','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
    return row

df2 = df.apply(update_val, axis='columns')
print(df2.head())

Okay a few things..

Your function diagnostic makes changes to the global df but it doesn't accept a DataFrame nor does it return anything. So when you call it with df2 = diagnostic(df) , you are not feeding df into it and you are not returning the modified DataFrame but a NoneType . That is why you are getting the error telling you that df2 is not a pd.DataFrame object, and therefore it has no attribute 'to_excel'.

It would be better if your function accepted df as an input, made changes to it, and returned the modified df as an output.

You only need to make two modifications:

1) include df as an argument in the first line: def diagnostic(df):

2) include return df as your last line.

Something like:

def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
    for i in range(10):
      ...
      ...
            df.loc[i, 'resource'] = 'ZBMDX4' # see explanation below.
    return df

Another issue is that you should probably be using df.loc[row, col] = new_val to update your values. df.update() accepts DataFrames (or objects coercible into DataFrames, from the doc), whereas you are updating one value at a time.

Another issue is that your conditions can be simplified. Rather than writing hour == x1 or hour == x2 or .... you can place possible values in a list and check for membership. Something like hour in [x1, x2, ...] .

Since there is a lot to unpack here, I wrote a skeleton of what I'm talking about:

Solution 1

def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
    for i in range(10):
        minutes = str(start_appointment[i])[10:12]
        hour = str(start_appointment[i])[8:10]
        if condition_1:
            df.loc[i, 'resource'] = 'ZBMDX3'
        elif condition_2:
            df.loc[i, 'resource'] = 'ZBMDX2'
        elif condition_3:
            df.loc[i, 'resource'] = 'ZBMDX1'
        elif condition_3:
            df.loc[i, 'resource'] = 'ZBMDX4'        
    return(df)

df2 = diagnostic(df)

and each condition would be your logic (something like condition_1 = if (minutes == '15') and hour in ['09', '10', '11']) etc

Solution 2

Another way to do it is to make a function that makes changes to each row based on some logic, and then apply this to your DataFrame. Something like the following:

def update_val(row):
    minutes = str(row['start appointment'])[10:12]
    hour = str(row['start appointment'])[8:10]
    resource = row['resource']
    cond1, cond2, cond3, cond4 = True, False, False, False
    if cond1:
        row['resource'] = 'ZBMDX3'
    elif cond2:
        row['resource'] = 'ZBMDX2'
    elif cond3:
        row['resource'] = 'ZBMDX1'
    elif cond4:
        row['resource'] = 'ZBMDX4'
    return row

df2 = df.apply(update_val, axis='columns')

where obviously you would update your conditional logic where I have put in the dummy conditions cond1 etc.

I prefer solution 2, as it is cleaner and easier to keep track of changes. It is also in general more performant (though I haven't verified in this particular case).

I don't have enough points to comment on your question. So I'll just post a modified version of your code that should work:

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    def time_range(start,stop):
        return [str(el).zfill(2) for el in range(start,stop+1)]

    minutes = str(row['start appointment'])[14:16] # [10:12] in sample data
    hour = str(row['start appointment'])[11:13] # [8:10] in sample data
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in time_range(8,15)) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and (hour in time_range(9,15)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in time_range(7,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in time_range(8,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')
print(df2.head())

I made two changes.

1) placed the sub-conditions in parentheses. I believe they were formatted incorrectly in your original formulation and so they never evaluated to True .

2) Changed the indexing of the start appointment row. Based on your sample data the original indexing was returning an empty str, and therefore never evaluating to any of the options.

ps you can just print the first first 5 rows to console to check if the values updated, rather than writing to disk each time.

Ok I've looked at sample data now and found the issue. There was trailing whitespace in the resource column, causing the logic to fail. This can be simply removed by using str.strip() . Also, the start appointment field is being parsed as a pandas.tslib.Timestamp object, which simplifies our logic by being able to extract the minute and hour tokens as int s. The following should work:

def update_val(row):
    minutes = row['start appointment'].minute
    hour = row['start appointment'].hour
    resource = row['resource'].strip()
    # Condition 1
    if (minutes == 0) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        new_resource = 'ZBMDX2'
    # Condition 2
    elif (minutes == 15) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX3'
    # Condition 3
    elif (minutes == 45) and (hour in [7,8,9,10,12,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX1'
    # Condition 4
    elif (minutes == 30) and (hour in [8,9,10,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX4'
    else:
        new_resource = resource
    row['resource'] = new_resource
    return row      

df2 = df.apply(update_val, axis='columns')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM