簡體   English   中英

如何在Pandas中更新DataFrame並導出到Excel?

[英]How can I update my DataFrame in Pandas and export out to Excel?

我是Python和編程的新手。 如果我的問題看起來很愚蠢或不清楚,請原諒我。 我已經做過研究,但坦率地說,我讀過的某些解釋很難理解。

我有一個數據框,其中包含需要評估和修改的醫院的大量預定約會數據,以便可以將其導入到新的預定應用程序中。 不幸的是,供應商的導入工具很垃圾並且進行零檢查,因此我必須編寫一些東西來檢查舊數據並將其轉換為新系統的上載數據。 這是格式的示例:

start appointment   department  procedure   resource
20171020131500      MAM         BDXMAMUNI   BDIAG2    
20171020133000      MAM         BDXMAMUNI   BDIAG1    
20171020141500      MAM         BDXMAMUNI   BDIAG2    
20171020143000      MAM         BDXMAMUNI   BDIAG1    
20171020144500      MAM         BDXMAMBIL   BDIAG2    
20171020150000      MAM         BDXMAMBIL   BDIAG1    
20171020151500      MAM         BDXMAMUNI   BDIAG2    
20171023080000      MAM         BDXMAMBIL   BDIAG1    
20171023081500      MAM         BDXMAMBIL   BDIAG2       

我正在嘗試根據條件進行更新。 這是我想出的,但是我無法更新它。 這是我個人的判斷標准。

如果在索引X分鍾的開始約會為分鍾= 15並且(hr = 8或h = 9或hr = 10或hr = 11或h = 13或hr = 14或hr = 15)並且資源= BDIAG1,BDIAG2或BDIAG 3,則開始索引X的約會將在索引X的資源ZBMDX3中

如果索引X的開始約會具有分鍾= 00並且(hr = 8或hr = 9或hr = 10或hr = 11或hr = 13或hr = 14或hr = 15),則索引X的開始約會將在資源中ZBMDX2在索引X

如果在索引X的開始約會分鍾數= 45,並且(hr = 7或hr = 8或hr = 9或hr = 10或hr 12或hr = 13或hr = 14),則在索引X的開始約會將在索引ZBMDX1中X

如果在索引X處開始約會,分鍾= 30,並且(小時= 8或hr = 9或hr = 10或hr = 13或hr = 14),則索引X的開始約會將在索引X的資源ZBMDX4中

創建輸出文件時,它沒有任何更新的更改。 我對StackOverflow進行了一些研究,但我讀過的所有線程似乎都不起作用。 一些建議使用locs和ix和df.update做一些事情。

  import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet1')

  dept = df['department']
  resource = df['resource']
  start_appointment = df['start appointment']


  def diagnostic():  # Check Diagnostic Breast scheduled appointments
      for i in range(10):
          minutes = str(start_appointment[i])[14:16]
          hour = str(start_appointment[i])[11:13]
          if minutes == '15' and (
                  hour == '8' or hour == '9' or hour == '10' or hour == '11'             
            or hour == '13' or hour == '14' or hour == '15') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG3'):
        df.update['resource'][i] = 'ZBMDX3'
    elif minutes == '00' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '11' or hour == '13' or hour == '14' or hour == '15') 
            and (resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG2'):
        df.update['resource'][i] = 'ZBMDX2'
    elif minutes == '45' and (
            hour == '7' or hour == '8' or hour == '9' or hour == '10' or 
            hour == '12' or hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX1'
    elif minutes == '30' and (hour == '8' or hour == '9' or hour == '10' or 
            hour == '13' or hour == '14') and (
            resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or 
            resource[i] == 'BDIAG1'):
        df.update['resource'][i] = 'ZBMDX4'
  diagnostic()

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

我進行了建議的更改。

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

現在我出錯了。 追溯(最近一次調用最近):df2.to_excel(writer,'Sheet1')中的文件“ Excel Parse.py”,第55行,AttributeError:'NoneType'對象沒有屬性'to_excel'異常在以下位置被忽略:>追溯(最新最近一次調用):文件“ C:\\ ProgramData \\ Anaconda3 \\ lib \\ site-packages \\ xlsxwriter \\ workbook.py”,第153行, del例外:工作簿析構函數中捕獲了異常。 工作簿可能需要顯式close()。

Seiji,我完全更新了我的代碼以反映您的更改。 讓我們看一下解決方案2,因為它處理起來更快。

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[14:16]
    hour = str(row['start appointment'])[11:13]
    resource = row['resource']
    # cond1, cond2, cond3, cond4 = True, False, False, False
    # Condition 1
    if minutes == '00' and hour in ['8', '9', '10', '11', '13', '14', '15']
        and resource in ['BDIAG1', 'BDIAG2', 'BDIAG3'] == True:
    row['resource'] = 'ZBMDX2'
    # Condition 2
    elif minutes == '15' and  hour in ['9', '10','11','13','14','15']
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif minutes == '45' and hour in ['7','8','9','10','12','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
    row['resource'] = 'ZBMDX1'
    # Condition 4
    elif minutes == '30' and hour in ['8','9','10','13','14'] 
    and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

創建輸出文件后,我仍然看不到資源字段的更新。 我手動評估了前10行,以確保不滿足該標准,並且該標准可以運行,但該標准存在。

start appointment dept      procedure   resource
20171020131500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020133000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020141500    MAM       BDXMAMUNI   BDIAG2    should change to ZBMDX3
20171020143000    MAM       BDXMAMUNI   BDIAG1    should change to ZBMDX4
20171020144500    MAM       BDXMAMBIL   BDIAG2    should change to ZBMDX1

Seiji的解決方案1

import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet3')
# Pull Columns as a Variable
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']

def diagnostic(df):
    for i in range(1,100):
        minutes = str(start_appointment[i])[14:16]
        hour = str(start_appointment[i])[11:13]
        if minutes == '15' and  hour in ['9', '10','11','13','14','15'] and     resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX3'
        elif minutes == '00' and hour in ['8','9','10','11','13','14','15']     and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX2'
        elif minutes == '45' and hour in ['7','8','9','10','12','13','14']     and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX1'
        elif minutes == '30' and hour in ['8','9','10','13','14'] and     resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
            df.loc[i, 'resource'] = 'ZBMDX4'
    return df

df2 = diagnostic(df)

# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python     3\Python_Output.xlsx', engine='xlsxwriter')

# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')

# Save the result
writer.save()

同樣的問題。 沒有更新輸出文件。

修改小時和分鍾的切片

仍然沒有在輸出中顯示更新。 在這一點上,我想知道是否應該將xlsx文件另存為CSV而不使用任何庫,或者是否應該通過將每個列(開始約會,資源)遍歷到各自的列表中以從頭開始創建數據幀。 你怎么看?

import pandas as pd

my_file = 'C:\\Users\cboutsikos\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    minutes = str(row['start appointment'])[10:12]
    hour = str(row['start appointment'])[8:10]
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in ['8', '9', '10', '11', '13', '14',     '15']) \
         and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']) == True:
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and  (hour in ['9', '10','11','13','14','15']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in ['7','8','9','10','12','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in ['8','9','10','13','14']) \
            and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
    return row

df2 = df.apply(update_val, axis='columns')
print(df2.head())

好吧..

您的函數diagnostic會更改全局df但是它不接受DataFrame也不返回任何內容。 因此,當您使用df2 = diagnostic(df)調用它時,您沒有將df饋入其中,也沒有返回修改后的DataFrame而是NoneType 這就是為什么您會收到錯誤消息,告訴您df2不是pd.DataFrame對象,因此它沒有屬性“ to_excel”的原因。

如果您的函數接受df作為輸入,對其進行更改,然后將修改后的df作為輸出返回,那就更好了。

您只需要進行兩個修改:

1)在第一行中包含df作為參數: def diagnostic(df):

2)包含return df作為最后一行。

就像是:

def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
    for i in range(10):
      ...
      ...
            df.loc[i, 'resource'] = 'ZBMDX4' # see explanation below.
    return df

另一個問題是您可能應該使用df.loc[row, col] = new_val來更新您的值。 df.update()接受DataFrames(或從doc強制轉換為DataFrames的對象),而您一次要更新一個值。

另一個問題是可以簡化您的條件。 您可以將可能的值放在列表中並檢查成員資格,而不是寫hour == x1 or hour == x2 or .... 類似於hour in [x1, x2, ...]

由於這里有很多要解壓的東西,所以我寫了一個我要說的框架:

解決方案1

def diagnostic(df):  # Check Diagnostic Breast scheduled appointments
    for i in range(10):
        minutes = str(start_appointment[i])[10:12]
        hour = str(start_appointment[i])[8:10]
        if condition_1:
            df.loc[i, 'resource'] = 'ZBMDX3'
        elif condition_2:
            df.loc[i, 'resource'] = 'ZBMDX2'
        elif condition_3:
            df.loc[i, 'resource'] = 'ZBMDX1'
        elif condition_3:
            df.loc[i, 'resource'] = 'ZBMDX4'        
    return(df)

df2 = diagnostic(df)

並且每個條件都是您的邏輯(類似於condition_1 = if (minutes == '15') and hour in ['09', '10', '11'])等等

解決方案2

做到這一點的另一種方法是創建一個函數,該函數根據某種邏輯對每一行進行更改,然后將其應用於您的DataFrame。 類似於以下內容:

def update_val(row):
    minutes = str(row['start appointment'])[10:12]
    hour = str(row['start appointment'])[8:10]
    resource = row['resource']
    cond1, cond2, cond3, cond4 = True, False, False, False
    if cond1:
        row['resource'] = 'ZBMDX3'
    elif cond2:
        row['resource'] = 'ZBMDX2'
    elif cond3:
        row['resource'] = 'ZBMDX1'
    elif cond4:
        row['resource'] = 'ZBMDX4'
    return row

df2 = df.apply(update_val, axis='columns')

顯然,您將在我放置了虛擬條件cond1等的地方更新條件邏輯。

我更喜歡解決方案2,因為它更干凈,更容易跟蹤更改。 通常,它的性能也更高(盡管在這種情況下我還沒有驗證)。

我的觀點不足以評論您的問題。 因此,我將發布您的代碼的修改后的版本,該版本應該可以正常工作:

import pandas as pd

my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')

def update_val(row):
    def time_range(start,stop):
        return [str(el).zfill(2) for el in range(start,stop+1)]

    minutes = str(row['start appointment'])[14:16] # [10:12] in sample data
    hour = str(row['start appointment'])[11:13] # [8:10] in sample data
    resource = row['resource']
    # Condition 1
    if (minutes == '00') and (hour in time_range(8,15)) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        row['resource'] = 'ZBMDX2'
    # Condition 2
    elif (minutes == '15') and (hour in time_range(9,15)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX3'
    # Condition 3
    elif (minutes == '45') and (hour in time_range(7,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX1'
    # Condition 4
    elif (minutes == '30') and (hour in time_range(8,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        row['resource'] = 'ZBMDX4'
return row        

df2 = df.apply(update_val, axis='columns')
print(df2.head())

我做了兩個更改。

1)將子條件放在括號中。 我相信它們在您的原始格式中的格式不正確,因此它們從未評估為True

2)更改了start appointment行的索引。 根據您的樣本數據,原始索引將返回一個空的str,因此從不評估任何選項。

ps,您可以只打印前五行以控制台檢查值是否更新,而不是每次都寫入磁盤。

好的,我現在查看了示例數據,發現了問題。 resource列中尾隨空格,導致邏輯失敗。 使用str.strip()可以簡單地刪除它。 同樣, start appointment字段也被解析為pandas.tslib.Timestamp對象,它能夠將minutehour標記提取為int ,從而簡化了我們的邏輯。 以下應該工作:

def update_val(row):
    minutes = row['start appointment'].minute
    hour = row['start appointment'].hour
    resource = row['resource'].strip()
    # Condition 1
    if (minutes == 0) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
        new_resource = 'ZBMDX2'
    # Condition 2
    elif (minutes == 15) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX3'
    # Condition 3
    elif (minutes == 45) and (hour in [7,8,9,10,12,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX1'
    # Condition 4
    elif (minutes == 30) and (hour in [8,9,10,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
        new_resource = 'ZBMDX4'
    else:
        new_resource = resource
    row['resource'] = new_resource
    return row      

df2 = df.apply(update_val, axis='columns')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM