简体   繁体   English

pandas dataframe:保存并阅读excel表单| 将整数作为字符串处理

[英]pandas dataframe : save & read excel sheet | handling integers as strings

I have a pandas dataframe (df). 我有一个pandas数据帧(df)。

df has plenty of columns and rows, many of which are integers. df有很多列和行,其中很多都是整数。

My intention is to save the dataframe as an excel file and read it back again while retaining the integrity of the data. 我的目的是将数据帧保存为excel文件,并在保留数据完整性的同时再次读取它。

I'm using the following steps. 我正在使用以下步骤。

To Save into excel: 保存到excel:

writer = pd.ExcelWriter("myExcelFile.xlsx")

df.to_excel(writer, 'sheet_name')

writer.save()

To Read from excel: 从excel阅读:

import glob

files = glob.glob("myExcelFile*.xlsx")  #gives list of files

myFile = files[0]

df = pd.read_excel(myFile , sheetname='sheet_name', convert_float=True)

Please note the option " convert_float ". 请注意选项“ convert_float ”。 Supposedly, excel saves all numbers in float format. 据说,excel以浮动格式保存所有数字。 So this option is supposed to help convert all the float values to possible integers. 所以这个选项应该有助于将所有浮点值转换为可能的整数。

For instance, 1.0 -> 1. 例如,1.0 - > 1。

My requirement is to fetch the original integer values that I intended to save in this excel sheet and retrieve later. 我的要求是获取我打算在此Excel工作表中保存的原始整数值,然后再检索。 However this doesn't work for some reason. 然而,由于某种原因,这不起作用。 Am I going wrong somewhere ? 我在某个地方出错了吗?

Is there a way I can handle that during saving to excel ? 在保存到Excel时,有没有办法处理?

I've tried to tackle this by mutating integers to strings, storing strings in excel, reading strings back from excel, reconverting to integers. 我试图通过将整数变换为字符串,将字符串存储在excel中,从excel中读取字符串,重新转换为整数来解决这个问题。 But, the pain is too severe both for me and my app :-/ 但是,对我和我的应用来说,痛苦太严重了: - /

I can't replicate your problem. 我不能复制你的问题。 It seems to work fine for me: 它似乎对我很好:

import pandas as pd

df = pd.DataFrame({'Floats': [10.1, 20.2, 30.3, 20.0, 15.9, 30.1, 45.0],
                   'Integers': [10.0, 20.0, 30, 20, 15, 30, 45]})

filename = 'df.xlsx'

writer = pd.ExcelWriter(filename)
df.to_excel(writer)
writer.save()

df = pd.read_excel(filename, convert_float=True)
print df

Result: 结果:

   Floats  Integers
0    10.1        10
1    20.2        20
2    30.3        30
3    20.0        20
4    15.9        15
5    30.1        30
6    45.0        45

Do you get the same result when you run this code? 运行此代码时,您得到相同的结果吗? If so, then there must be something else going on. 如果是这样,那么必然会有其他事情发生。 Can you give us code that demonstrates the problem? 你能给我们一些证明问题的代码吗?

Note that each column that has at least one float in it will make the whole column be treated as floats, because you can't usually have multiple datatypes in a given column (see below regarding the object column type). 请注意,每个列中至少包含一个浮点数的列将使整个列被视为浮点数,因为在给定列中通常不能有多个数据类型(请参阅下面有关object列类型的信息)。

One workaround if the above code doesn't work for some reason would be to force certain columns and/or the index to be integers manually, like this: 如果上面的代码由于某种原因不起作用的一种解决方法是强制某些列和/或索引手动整数,如下所示:

df = pd.read_excel(filename) # convert_float=False by default
df['Integers'] = df['Integers'].astype(int)
df.index = df.index.astype(int)
print df

And you could force all columns to be integers like this: 你可以强制所有列都是这样的整数:

df = pd.read_excel(filename).astype(int)

Edit after OP gave more detail: OP后的编辑提供了更多细节:

If you know which columns need to be treated as strings, you can use the same same manual technique from above: 如果您知道需要将哪些列视为字符串,则可以使用上面相同的手动技术:

df['Strings'] = df['Strings'].astype(str)

But you want it to be more automatic. 但是你希望它更自动化。 This is hacky, but it works. 这很hacky,但它确实有效。 If you add a dummy string to the end of your data that is blatantly a string, like 'dummy' , then pandas will bring the column in as objects, with each element having its own datatype. 如果在数据的末尾添加一个虚拟字符串,这些字符串显然是一个字符串,比如'dummy' ,那么pandas会将列作为对象引入,每个元素都有自己的数据类型。 Without the dummy string, it doesn't work. 没有虚拟字符串,它不起作用。 You can try the commented out dataframe in my code to see. 您可以在我的代码中尝试注释掉的数据框。

import pandas as pd

# This works.
df = pd.DataFrame({'Floats': [10.1, 20.2, 30.3, 20.0, 15.9, 30.1, 0],
                   'Objects': ['10.0', 20.0, 30.5, 20, 15, 30, 'dummy']})
# This doesn't work.
# df = pd.DataFrame({'Floats': [10.1, 20.2, 30.3, 20.0, 15.9, 30.1],
#                  'Objects': ['10.0', 20.0, 30.5, 20, 15, 30]})

filename = 'df.xlsx'

writer = pd.ExcelWriter(filename)
df.to_excel(writer)
writer.save()

# Remove the dummy row.
df = pd.read_excel(filename)[:-1] 

print df
print
print df.dtypes
print
print df.loc[0, 'Objects'], type(df.loc[0, 'Objects'])
print df.loc[1, 'Objects'], type(df.loc[1, 'Objects'])
print df.loc[2, 'Objects'], type(df.loc[2, 'Objects'])
print df.loc[3, 'Objects'], type(df.loc[3, 'Objects'])

Result: 结果:

   Floats Objects
0    10.1    10.0
1    20.2      20
2    30.3    30.5
3    20.0      20
4    15.9      15
5    30.1      30

Floats     float64
Objects     object
dtype: object

10.0 <type 'unicode'>
20 <type 'int'>
30.5 <type 'float'>
20 <type 'int'>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM