[英]How can i export correctly my json to a dataframe in pandas
[英]How can I update my DataFrame in Pandas and export out to Excel?
我是Python和编程的新手。 如果我的问题看起来很愚蠢或不清楚,请原谅我。 我已经做过研究,但坦率地说,我读过的某些解释很难理解。
我有一个数据框,其中包含需要评估和修改的医院的大量预定约会数据,以便可以将其导入到新的预定应用程序中。 不幸的是,供应商的导入工具很垃圾并且进行零检查,因此我必须编写一些东西来检查旧数据并将其转换为新系统的上载数据。 这是格式的示例:
start appointment department procedure resource
20171020131500 MAM BDXMAMUNI BDIAG2
20171020133000 MAM BDXMAMUNI BDIAG1
20171020141500 MAM BDXMAMUNI BDIAG2
20171020143000 MAM BDXMAMUNI BDIAG1
20171020144500 MAM BDXMAMBIL BDIAG2
20171020150000 MAM BDXMAMBIL BDIAG1
20171020151500 MAM BDXMAMUNI BDIAG2
20171023080000 MAM BDXMAMBIL BDIAG1
20171023081500 MAM BDXMAMBIL BDIAG2
我正在尝试根据条件进行更新。 这是我想出的,但是我无法更新它。 这是我个人的判断标准。
如果在索引X分钟的开始约会为分钟= 15并且(hr = 8或h = 9或hr = 10或hr = 11或h = 13或hr = 14或hr = 15)并且资源= BDIAG1,BDIAG2或BDIAG 3,则开始索引X的约会将在索引X的资源ZBMDX3中
如果索引X的开始约会具有分钟= 00并且(hr = 8或hr = 9或hr = 10或hr = 11或hr = 13或hr = 14或hr = 15),则索引X的开始约会将在资源中ZBMDX2在索引X
如果在索引X的开始约会分钟数= 45,并且(hr = 7或hr = 8或hr = 9或hr = 10或hr 12或hr = 13或hr = 14),则在索引X的开始约会将在索引ZBMDX1中X
如果在索引X处开始约会,分钟= 30,并且(小时= 8或hr = 9或hr = 10或hr = 13或hr = 14),则索引X的开始约会将在索引X的资源ZBMDX4中
创建输出文件时,它没有任何更新的更改。 我对StackOverflow进行了一些研究,但我读过的所有线程似乎都不起作用。 一些建议使用locs和ix和df.update做一些事情。
import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet1')
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']
def diagnostic(): # Check Diagnostic Breast scheduled appointments
for i in range(10):
minutes = str(start_appointment[i])[14:16]
hour = str(start_appointment[i])[11:13]
if minutes == '15' and (
hour == '8' or hour == '9' or hour == '10' or hour == '11'
or hour == '13' or hour == '14' or hour == '15') and (
resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG3'):
df.update['resource'][i] = 'ZBMDX3'
elif minutes == '00' and (hour == '8' or hour == '9' or hour == '10' or
hour == '11' or hour == '13' or hour == '14' or hour == '15')
and (resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG2'):
df.update['resource'][i] = 'ZBMDX2'
elif minutes == '45' and (
hour == '7' or hour == '8' or hour == '9' or hour == '10' or
hour == '12' or hour == '13' or hour == '14') and (
resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG1'):
df.update['resource'][i] = 'ZBMDX1'
elif minutes == '30' and (hour == '8' or hour == '9' or hour == '10' or
hour == '13' or hour == '14') and (
resource[i] == 'BIDAG1' or resource[i] == 'BDIAG2' or
resource[i] == 'BDIAG1'):
df.update['resource'][i] = 'ZBMDX4'
diagnostic()
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
df2 = diagnostic(df)
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
现在我出错了。 追溯(最近一次调用最近):df2.to_excel(writer,'Sheet1')中的文件“ Excel Parse.py”,第55行,AttributeError:'NoneType'对象没有属性'to_excel'异常在以下位置被忽略:>追溯(最新最近一次调用):文件“ C:\\ ProgramData \\ Anaconda3 \\ lib \\ site-packages \\ xlsxwriter \\ workbook.py”,第153行, del例外:工作簿析构函数中捕获了异常。 工作簿可能需要显式close()。
import pandas as pd
my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')
def update_val(row):
minutes = str(row['start appointment'])[14:16]
hour = str(row['start appointment'])[11:13]
resource = row['resource']
# cond1, cond2, cond3, cond4 = True, False, False, False
# Condition 1
if minutes == '00' and hour in ['8', '9', '10', '11', '13', '14', '15']
and resource in ['BDIAG1', 'BDIAG2', 'BDIAG3'] == True:
row['resource'] = 'ZBMDX2'
# Condition 2
elif minutes == '15' and hour in ['9', '10','11','13','14','15']
and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
row['resource'] = 'ZBMDX3'
# Condition 3
elif minutes == '45' and hour in ['7','8','9','10','12','13','14']
and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
row['resource'] = 'ZBMDX1'
# Condition 4
elif minutes == '30' and hour in ['8','9','10','13','14']
and resource in ['BDIAG1','BDIAG2','BDIAG3'] == True:
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\user_name\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
创建输出文件后,我仍然看不到资源字段的更新。 我手动评估了前10行,以确保不满足该标准,并且该标准可以运行,但该标准存在。
start appointment dept procedure resource
20171020131500 MAM BDXMAMUNI BDIAG2 should change to ZBMDX3
20171020133000 MAM BDXMAMUNI BDIAG1 should change to ZBMDX4
20171020141500 MAM BDXMAMUNI BDIAG2 should change to ZBMDX3
20171020143000 MAM BDXMAMUNI BDIAG1 should change to ZBMDX4
20171020144500 MAM BDXMAMBIL BDIAG2 should change to ZBMDX1
import pandas as pd
df = pd.read_excel(my_file, sheet_name='Sheet3')
# Pull Columns as a Variable
dept = df['department']
resource = df['resource']
start_appointment = df['start appointment']
def diagnostic(df):
for i in range(1,100):
minutes = str(start_appointment[i])[14:16]
hour = str(start_appointment[i])[11:13]
if minutes == '15' and hour in ['9', '10','11','13','14','15'] and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX3'
elif minutes == '00' and hour in ['8','9','10','11','13','14','15'] and resource[i] in ['BDIAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX2'
elif minutes == '45' and hour in ['7','8','9','10','12','13','14'] and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX1'
elif minutes == '30' and hour in ['8','9','10','13','14'] and resource[i] in ['BIDAG1','BDIAG2','BDIAG3']:
df.loc[i, 'resource'] = 'ZBMDX4'
return df
df2 = diagnostic(df)
# Specify a writer
writer = pd.ExcelWriter('C:\\Users\cboutsikos\Desktop\Python 3\Python_Output.xlsx', engine='xlsxwriter')
# Write your DataFrame to a file
df2.to_excel(writer, 'Sheet1')
# Save the result
writer.save()
同样的问题。 没有更新输出文件。
仍然没有在输出中显示更新。 在这一点上,我想知道是否应该将xlsx文件另存为CSV而不使用任何库,或者是否应该通过将每个列(开始约会,资源)遍历到各自的列表中以从头开始创建数据帧。 你怎么看?
import pandas as pd
my_file = 'C:\\Users\cboutsikos\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')
def update_val(row):
minutes = str(row['start appointment'])[10:12]
hour = str(row['start appointment'])[8:10]
resource = row['resource']
# Condition 1
if (minutes == '00') and (hour in ['8', '9', '10', '11', '13', '14', '15']) \
and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']) == True:
row['resource'] = 'ZBMDX2'
# Condition 2
elif (minutes == '15') and (hour in ['9', '10','11','13','14','15']) \
and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX3'
# Condition 3
elif (minutes == '45') and (hour in ['7','8','9','10','12','13','14']) \
and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX1'
# Condition 4
elif (minutes == '30') and (hour in ['8','9','10','13','14']) \
and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
print(df2.head())
好吧..
您的函数diagnostic
会更改全局df
但是它不接受DataFrame也不返回任何内容。 因此,当您使用df2 = diagnostic(df)
调用它时,您没有将df
馈入其中,也没有返回修改后的DataFrame而是NoneType
。 这就是为什么您会收到错误消息,告诉您df2
不是pd.DataFrame对象,因此它没有属性“ to_excel”的原因。
如果您的函数接受df
作为输入,对其进行更改,然后将修改后的df
作为输出返回,那就更好了。
您只需要进行两个修改:
1)在第一行中包含df
作为参数: def diagnostic(df):
2)包含return df
作为最后一行。
就像是:
def diagnostic(df): # Check Diagnostic Breast scheduled appointments
for i in range(10):
...
...
df.loc[i, 'resource'] = 'ZBMDX4' # see explanation below.
return df
另一个问题是您可能应该使用df.loc[row, col] = new_val
来更新您的值。 df.update()
接受DataFrames(或从doc强制转换为DataFrames的对象),而您一次要更新一个值。
另一个问题是可以简化您的条件。 您可以将可能的值放在列表中并检查成员资格,而不是写hour == x1 or hour == x2 or ....
类似于hour in [x1, x2, ...]
。
由于这里有很多要解压的东西,所以我写了一个我要说的框架:
解决方案1
def diagnostic(df): # Check Diagnostic Breast scheduled appointments
for i in range(10):
minutes = str(start_appointment[i])[10:12]
hour = str(start_appointment[i])[8:10]
if condition_1:
df.loc[i, 'resource'] = 'ZBMDX3'
elif condition_2:
df.loc[i, 'resource'] = 'ZBMDX2'
elif condition_3:
df.loc[i, 'resource'] = 'ZBMDX1'
elif condition_3:
df.loc[i, 'resource'] = 'ZBMDX4'
return(df)
df2 = diagnostic(df)
并且每个条件都是您的逻辑(类似于condition_1 = if (minutes == '15') and hour in ['09', '10', '11'])
等等
解决方案2
做到这一点的另一种方法是创建一个函数,该函数根据某种逻辑对每一行进行更改,然后将其应用于您的DataFrame。 类似于以下内容:
def update_val(row):
minutes = str(row['start appointment'])[10:12]
hour = str(row['start appointment'])[8:10]
resource = row['resource']
cond1, cond2, cond3, cond4 = True, False, False, False
if cond1:
row['resource'] = 'ZBMDX3'
elif cond2:
row['resource'] = 'ZBMDX2'
elif cond3:
row['resource'] = 'ZBMDX1'
elif cond4:
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
显然,您将在我放置了虚拟条件cond1
等的地方更新条件逻辑。
我更喜欢解决方案2,因为它更干净,更容易跟踪更改。 通常,它的性能也更高(尽管在这种情况下我还没有验证)。
我的观点不足以评论您的问题。 因此,我将发布您的代码的修改后的版本,该版本应该可以正常工作:
import pandas as pd
my_file = 'C:\\Users\user_name\Desktop\Python 3\schdocexprt10_Bob - Copy.xlsx'
df = pd.read_excel(my_file, sheetname='Sheet3')
def update_val(row):
def time_range(start,stop):
return [str(el).zfill(2) for el in range(start,stop+1)]
minutes = str(row['start appointment'])[14:16] # [10:12] in sample data
hour = str(row['start appointment'])[11:13] # [8:10] in sample data
resource = row['resource']
# Condition 1
if (minutes == '00') and (hour in time_range(8,15)) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
row['resource'] = 'ZBMDX2'
# Condition 2
elif (minutes == '15') and (hour in time_range(9,15)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX3'
# Condition 3
elif (minutes == '45') and (hour in time_range(7,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX1'
# Condition 4
elif (minutes == '30') and (hour in time_range(8,14)) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
row['resource'] = 'ZBMDX4'
return row
df2 = df.apply(update_val, axis='columns')
print(df2.head())
我做了两个更改。
1)将子条件放在括号中。 我相信它们在您的原始格式中的格式不正确,因此它们从未评估为True
。
2)更改了start appointment
行的索引。 根据您的样本数据,原始索引将返回一个空的str,因此从不评估任何选项。
ps,您可以只打印前五行以控制台检查值是否更新,而不是每次都写入磁盘。
好的,我现在查看了示例数据,发现了问题。 resource
列中尾随空格,导致逻辑失败。 使用str.strip()
可以简单地删除它。 同样, start appointment
字段也被解析为pandas.tslib.Timestamp
对象,它能够将minute
和hour
标记提取为int
,从而简化了我们的逻辑。 以下应该工作:
def update_val(row):
minutes = row['start appointment'].minute
hour = row['start appointment'].hour
resource = row['resource'].strip()
# Condition 1
if (minutes == 0) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1', 'BDIAG2', 'BDIAG3']):
new_resource = 'ZBMDX2'
# Condition 2
elif (minutes == 15) and (hour in [8,9,10,11,13,14,15]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
new_resource = 'ZBMDX3'
# Condition 3
elif (minutes == 45) and (hour in [7,8,9,10,12,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
new_resource = 'ZBMDX1'
# Condition 4
elif (minutes == 30) and (hour in [8,9,10,13,14]) and (resource in ['BDIAG1','BDIAG2','BDIAG3']):
new_resource = 'ZBMDX4'
else:
new_resource = resource
row['resource'] = new_resource
return row
df2 = df.apply(update_val, axis='columns')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.