简体   繁体   English

阅读 Excel 错误地解析欧洲日期(Python 3.4.3 || Pandas 0.17.0)

[英]Read Excel incorrectly parsing european dates (Python 3.4.3 || Pandas 0.17.0)

In the line of the following question, which doesn't seem to have an answer yet.在下面的问题中,似乎还没有答案。

Read dates from excel to Pandas Dataframe 读取日期从 excel 到 Pandas Dataframe

On European machines Pandas has a confusing bug while parsing dates from an Excelsheet with a european format (dd-mm-yyyy).在欧洲机器上 Pandas 在解析欧洲格式 (dd-mm-yyyy) 的 Excel 表中的日期时有一个令人困惑的错误。 Dates with a day number from 1-12 will automatically be converted to the american standard (mm-dd-yyyy) while dates with a day number > 12 are parsed in a European way (dd-mm-yyyy).天数为 1-12 的日期将自动转换为美国标准 (mm-dd-yyyy),而天数大于 12 的日期将以欧洲方式 (dd-mm-yyyy) 进行解析。 This obviously leads to problems.这显然会导致问题。

  • 10-05-2011 => 05-10-2011 10-05-2011 => 05-10-2011
  • 05-10-2011 => 10-05-2011 05-10-2011 => 10-05-2011
  • 31-05-2011 => 31-05-2011 31-05-2011 => 31-05-2011
  • 14-12-2011 => 14-12-2011 14-12-2011 => 14-12-2011

There is always a solution to post process the dates and switch them around if both 'day' and 'month' are < 13, but that doesn't seem to be the way it suppose to work.如果“天”和“月”都小于 13,总有一个解决方案可以对日期进行后处理并切换它们,但这似乎不是它应该工作的方式。 Has anyone found a better solution?有没有人找到更好的解决方案? Thanks in advance!提前致谢!

python: '3.4.3 |Anaconda 2.1.0 (x86_64)| python:'3.4.3 |Anaconda 2.1.0 (x86_64)| (default, Oct 20 2015, 14:27:51) \n[GCC 4.2.1 (Apple Inc. build 5577)] (默认,2015 年 10 月 20 日,14:27:51)\n[GCC 4.2.1(Apple Inc. build 5577)]

Pandas: '0.17.0' Pandas: '0.17.0'

EDIT 17 nov 2015编辑 2015 年 11 月 17 日

Found a workaround/solution myself: included dayfirst=True with to_datetime()我自己找到了一个解决方法/解决方案:包括dayfirst=True with to_datetime()

It still seems like a bug to me.对我来说它仍然像是一个错误。 I added a simplified version of my code to give some more context.我添加了我的代码的简化版本以提供更多上下文。 The script reads an excelsheet with personal data and converts to create a new sheet that can be used for server upload.该脚本读取包含个人数据的 Excel 工作表并进行转换以创建可用于服务器上传的新工作表。 The input can vary quite a lot, but I simplified the example.输入可能有很大差异,但我简化了示例。

Added my solution in the code and let it make 2 date outputs: one with and one without dayfirst=True在代码中添加我的解决方案并让它输出 2 个日期:一个有和一个没有dayfirst=True

Ran the code on two different excel sheets.在两个不同的 excel 表上运行代码。 One had no problem at all (the xlsx file, example 2) and the other (xls, example 1) had a difference between the columns.一个完全没有问题(xlsx 文件,示例 2),另一个(xls,示例 1)在列之间存在差异。 It seems like pandas correctly recognizes day and month, but has difficulty creating a string from a date and mixes up the order automatically in ipython output.似乎 pandas 正确识别了日期和月份,但很难从日期创建字符串并在 ipython output 中自动混淆顺序。

Input list for example 1示例 1 的输入列表例如输入列表

Final list for xls file, see the problem with Name 4 xls 文件的最终列表,请参阅名称 4 的问题xls 文件的最终列表,请参阅名称 4 的问题

Input list for example 2示例 2 的输入列表示例 2 的输入列表

Final list for xlsx file, no problem with name 3 xlsx 文件的最终列表,名称 3 没问题xlsx 文件的最终列表,名称 3 没问题

# Module for test list

path = "xxxx"
namefile = "testlist 1.xls"
#namefile = "testlist 2.xlsx"
schoolnaam = 'schoolname'
BRIN = 'XXXX'
meetperiode = 'MPX'
meetjaar = '20xx/20xx'

os.chdir(path)

df = pd.DataFrame()
df = pd.read_excel(namefile,0, header = None, parse_dates = True)

df1 = pd.DataFrame()
df1 = df

df1.columns = ['Leerlingnummer', 'Achternaam', 'Geslacht', 'Blank', 'Leerjaar', 'Gebdatum']
df1 = df1[['Leerlingnummer', 'Achternaam', 'Geslacht', 'Gebdatum']]

# Sheet Leerling

df1.loc[df1['Leerlingnummer'].str.contains('Groep|/|A|B|C|D|E|F|G|H|I|J', na=False), 'Naam groep'] = df1.Leerlingnummer
df1['Naam groep'] = df1['Naam groep'].ffill()

df1.dropna(thresh=5, inplace = True)


df1['Achternaam'] = df1['Achternaam'].str.strip()
df1['Geslacht'] = df1['Geslacht'].str.strip().str.upper()
df1['Naam groep'] = df1['Naam groep'].str.strip()
df1['Voornaam'] = np.nan
df1['Tussenvoegsel'] = np.nan
df1['Geboortedatum']= pd.to_datetime(df1.Gebdatum).apply(lambda x: x.strftime('%d-%m-%Y'))
df1['Geboortedatum2']= pd.to_datetime(df1.Gebdatum, dayfirst=True).apply(lambda x: x.strftime('%d-%m-%Y'))

dfLeerling = df1[['Achternaam','Voornaam','Tussenvoegsel','Geslacht','Geboortedatum','Geboortedatum2','Naam groep']]


# Sheet Groep

gb = df1.groupby('Naam groep')
klaslijst = list(gb.groups)
klaslijst.sort()

dfGroep = pd.DataFrame(data = klaslijst, columns=['Naam groep'])
dfGroep['Lesjaar'] = meetjaar
dfGroep['Naam leraar'] = np.nan
dfGroep['Opmerkingen'] = np.nan

# Sheet School

dfSchool = pd.DataFrame({'BRIN': BRIN, 'Naam school': schoolnaam, 'Adres':[np.nan], 'Postcode':[np.nan], 'Plaats':[np.nan],
                       'Telefoon':[np.nan], 'Email':[np.nan], 'Website':[np.nan]})
dfSchool = dfSchool[['BRIN','Naam school','Adres','Postcode','Plaats','Telefoon','Email','Website']]

# Writer

namefile2 = 'Final list %s %s.xlsx' % (schoolnaam, meetperiode)

writer = pd.ExcelWriter(namefile2)
dfSchool.to_excel(writer, 'School', index=False)
dfGroep.to_excel(writer, 'Groep', index=False)
dfLeerling.to_excel(writer, 'Leerling', index=False)
writer.save()

dfLeerling.head()

When this happens I create the dataframe already forcing the type to str for the date columns so it doesn't get anything interpreted发生这种情况时,我创建了 dataframe 已经强制日期列的类型为 str,因此它不会得到任何解释

dtype={'x':'str','y':'str'}

After that you can use the to_datetime() method pointing out the format you want之后你可以使用 to_datetime() 方法指出你想要的格式

format='%d/%m/%Y'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM