繁体   English   中英

如何在Pandas中更新DataFrame并导出到Excel?

[英]How can I update my DataFrame in Pandas and export out to Excel?

我是Python和编程的新手。 如果我的问题看起来很愚蠢或不清楚,请原谅我。 我已经做过研究,但坦率地说,我读过的某些解释很难理解。

我有一个数据框,其中包含需要评估和修改的医院的大量预定约会数据,以便可以将其导入到新的预定应用程序中。 不幸的是,供应商的导入工具很垃圾并且进行零检查,因此我必须编写一些东西来检查旧数据并将其转换为新系统的上载数据。 这是格式的示例:

start appointment   department  procedure   resource
20171020131500      MAM         BDXMAMUNI   BDIAG2    
20171020133000      MAM         BDXMAMUNI   BDIAG1    
20171020141500      MAM         BDXMAMUNI   BDIAG2    
20171020143000      MAM         BDXMAMUNI   BDIAG1    
20171020144500      MAM         BDXMAMBIL   BDIAG2    
20171020150000      MAM         BDXMAMBIL   BDIAG1    
20171020151500      MAM         BDXMAMUNI   BDIAG2    
20171023080000      MAM         BDXMAMBIL   BDIAG1    
20171023081500      MAM         BDXMAMBIL   BDIAG2       

我正在尝试根据条件进行更新。 这是我想出的,但是我无法更新它。 这是我个人的判断标准。

如果在索引X分钟的开始约会为分钟= 15并且(hr = 8或h = 9或hr = 10或hr = 11或h = 13或hr = 14或hr = 15)并且资源= BDIAG1,BDIAG2或BDIAG 3,则开始索引X的约会将在索引X的资源ZBMDX3中

如果索引X的开始约会具有分钟= 00并且(hr = 8或hr = 9或hr = 10或hr = 11或hr = 13或hr = 14或hr = 15),则索引X的开始约会将在资源中ZBMDX2在索引X

如果在索引X的开始约会分钟数= 45,并且(hr = 7或hr = 8或hr = 9或hr = 10或hr 12或hr = 13或hr = 14),则在索引X的开始约会将在索引ZBMDX1中X

如果在索引X处开始约会,分钟= 30,并且(小时= 8或hr = 9或hr = 10或hr = 13或hr = 14),则索引X的开始约会将在索引X的资源ZBMDX4中

创建输出文件时,它没有任何更新的更改。 我对StackOverflow进行了一些研究,但我读过的所有线程似乎都不起作用。 一些建议使用locs和ix和df.update做一些事情。

  import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet1')

  dept = df['department']
  resource = df['resource']
  start_appointment = df['start appointment']


  def diagnostic():  # Check Diagnostic Breast scheduled appointments
      for i in range(10):
          minutes = str(start_appointment[i])[14:16]
          hour = str(start_appointment[i])[11:13]
          if minutes == '15' and (
                  hour == '8' or hour == '9' or hour == '10' or hour == '11'             
            or hour == '13' or hour == '14' or hour == '15') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG3'):
        df.update['resource'][i] = 'ZBMDX3'
    elif minutes == '00' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '11' or hour == '13' or hour == '14' or hour == '15') 
            and (resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG2'):
        df.update['resource'][i] = 'ZBMDX2'
    elif minutes == '45' and (
            hour == '7' or hour == '8' or hour == '9' or hour == '10' or 
            hour == '12' or hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX1'
    elif minutes == '30' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX4'
  diagnostic()

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

我进行了建议的更改。

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

现在我出错了。 追溯(最近一次调用最近):df2.to_excel(writer,'Sheet1')中的文件“ Excel Parse.py”,第55行,AttributeError:'NoneType'对象没有属性'to_excel'异常在以下位置被忽略:>追溯(最新最近一次调用):文件“ C:\\ ProgramData \\ Anaconda3 \\ lib \\ site-packages \\ xlsxwriter \\ workbook.py”,第153行, del例外:工作簿析构函数中捕获了异常。 工作簿可能需要显式close()。

Seiji,我完全更新了我的代码以反映您的更改。 让我们看一下解决方案2,因为它处理起来更快。

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[14:16]
    hour = str(row['start appointment'])[11:13]
    resource = row['resource']
    # cond1, cond2, cond3, cond4 = True, False, False, False
    # Condition 1
    if minutes == '00' and hour in ['8', '9', '10', '11', '13', '14', '15']
        and resource in ['BDIAG1', 'BDIAG2', 'BDIAG3'] == True:
    row['resource'] = 'ZBMDX2'
    # Condition 2
    elif minutes == '15' and  hour in ['9', '10','11','13','14','15']
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif minutes == '45' and hour in ['7','8','9','10','12','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
    row['resource'] = 'ZBMDX1'
    # Condition 4
    elif minutes == '30' and hour in ['8','9','10','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

创建输出文件后,我仍然看不到资源字段的更新。 我手动评估了前10行,以确保不满足该标准,并且该标准可以运行,但该标准存在。

start appointment dept      procedure   resource
20171020131500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020133000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020141500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020143000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020144500    MAM       BDXMAMBIL   BDIAG2    should change to ZBMDX1

Seiji的解决方案1

import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet3')
# Pull Columns as a Variable
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']

def diagnostic(df):
    for i in range(1,100):
        minutes = str(start_appointment[i])[14:16]
        hour = str(start_appointment[i])[11:13]
        if minutes == '15' and  hour in ['9', '10','11','13','14','15'] and     resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX3'
        elif minutes == '00' and hour in ['8','9','10','11','13','14','15']     and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX2'
        elif minutes == '45' and hour in ['7','8','9','10','12','13','14']     and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX1'
        elif minutes == '30' and hour in ['8','9','10','13','14'] and     resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX4'
    return df

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

同样的问题。 没有更新输出文件。

修改小时和分钟的切片

仍然没有在输出中显示更新。 在这一点上,我想知道是否应该将xlsx文件另存为CSV而不使用任何库,或者是否应该通过将每个列(开始约会,资源)遍历到各自的列表中以从头开始创建数据帧。 你怎么看?

import pandas as pd

my_file = 'C:\\Users\cboutsikos\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[10:12]
    hour = str(row['start appointment'])[8:10]
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in ['8', '9', '10', '11', '13', '14',     '15']) \
         and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']) == True:
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and  (hour in ['9', '10','11','13','14','15']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in ['7','8','9','10','12','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in ['8','9','10','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
    return row

df2 = df.apply(update_val, axis='columns')
print(df2.head())

好吧..

您的函数diagnostic会更改全局df但是它不接受DataFrame也不返回任何内容。 因此,当您使用df2 = diagnostic(df)调用它时,您没有将df馈入其中,也没有返回修改后的DataFrame而是NoneType 这就是为什么您会收到错误消息,告诉您df2不是pd.DataFrame对象,因此它没有属性“ to_excel”的原因。

如果您的函数接受df作为输入,对其进行更改,然后将修改后的df作为输出返回,那就更好了。

您只需要进行两个修改:

1)在第一行中包含df作为参数: def diagnostic(df):

2)包含return df作为最后一行。

就像是:

def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
    for i in range(10):
      ...
      ...
            df.loc[i, 'resource'] = 'ZBMDX4' # see explanation below.
    return df

另一个问题是您可能应该使用df.loc[row, col] = new_val来更新您的值。 df.update()接受DataFrames(或从doc强制转换为DataFrames的对象),而您一次要更新一个值。

另一个问题是可以简化您的条件。 您可以将可能的值放在列表中并检查成员资格,而不是写hour == x1 or hour == x2 or .... 类似于hour in [x1, x2, ...]

由于这里有很多要解压的东西,所以我写了一个我要说的框架:

解决方案1

def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
    for i in range(10):
        minutes = str(start_appointment[i])[10:12]
        hour = str(start_appointment[i])[8:10]
        if condition_1:
            df.loc[i, 'resource'] = 'ZBMDX3'
        elif condition_2:
            df.loc[i, 'resource'] = 'ZBMDX2'
        elif condition_3:
            df.loc[i, 'resource'] = 'ZBMDX1'
        elif condition_3:
            df.loc[i, 'resource'] = 'ZBMDX4'        
    return(df)

df2 = diagnostic(df)

并且每个条件都是您的逻辑(类似于condition_1 = if (minutes == '15') and hour in ['09', '10', '11'])等等

解决方案2

做到这一点的另一种方法是创建一个函数,该函数根据某种逻辑对每一行进行更改,然后将其应用于您的DataFrame。 类似于以下内容:

def update_val(row):
    minutes = str(row['start appointment'])[10:12]
    hour = str(row['start appointment'])[8:10]
    resource = row['resource']
    cond1, cond2, cond3, cond4 = True, False, False, False
    if cond1:
        row['resource'] = 'ZBMDX3'
    elif cond2:
        row['resource'] = 'ZBMDX2'
    elif cond3:
        row['resource'] = 'ZBMDX1'
    elif cond4:
        row['resource'] = 'ZBMDX4'
    return row

df2 = df.apply(update_val, axis='columns')

显然,您将在我放置了虚拟条件cond1等的地方更新条件逻辑。

我更喜欢解决方案2,因为它更干净,更容易跟踪更改。 通常,它的性能也更高(尽管在这种情况下我还没有验证)。

我的观点不足以评论您的问题。 因此,我将发布您的代码的修改后的版本,该版本应该可以正常工作:

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    def time_range(start,stop):
        return [str(el).zfill(2) for el in range(start,stop+1)]

    minutes = str(row['start appointment'])[14:16] # [10:12] in sample data
    hour = str(row['start appointment'])[11:13] # [8:10] in sample data
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in time_range(8,15)) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and (hour in time_range(9,15)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in time_range(7,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in time_range(8,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')
print(df2.head())

我做了两个更改。

1)将子条件放在括号中。 我相信它们在您的原始格式中的格式不正确,因此它们从未评估为True

2)更改了start appointment行的索引。 根据您的样本数据,原始索引将返回一个空的str,因此从不评估任何选项。

ps,您可以只打印前五行以控制台检查值是否更新,而不是每次都写入磁盘。

好的,我现在查看了示例数据,发现了问题。 resource列中尾随空格,导致逻辑失败。 使用str.strip()可以简单地删除它。 同样, start appointment字段也被解析为pandas.tslib.Timestamp对象,它能够将minutehour标记提取为int ,从而简化了我们的逻辑。 以下应该工作:

def update_val(row):
    minutes = row['start appointment'].minute
    hour = row['start appointment'].hour
    resource = row['resource'].strip()
    # Condition 1
    if (minutes == 0) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        new_resource = 'ZBMDX2'
    # Condition 2
    elif (minutes == 15) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX3'
    # Condition 3
    elif (minutes == 45) and (hour in [7,8,9,10,12,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX1'
    # Condition 4
    elif (minutes == 30) and (hour in [8,9,10,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX4'
    else:
        new_resource = resource
    row['resource'] = new_resource
    return row      

df2 = df.apply(update_val, axis='columns')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM