[英]After converting a CSV file to Excel, integers are stored as strings - how to convert them back?
In this project I've converted a csv file to an xls file and a txt file to an xls file.在这个项目中,我将 csv 文件转换为 xls 文件,将 txt 文件转换为 xls 文件。 The objective is to then compare both xls files for differences and print out any differences to a third excel file.目标是然后比较两个 xls 文件的差异并将任何差异打印到第三个 excel 文件。
However, when the differences are printed they include any entry with an integer above 999, as any integer from my converted csv file is treated as a string instead of an integer.但是,当打印差异时,它们包括整数大于 999 的任何条目,因为我转换后的 csv 文件中的任何整数都被视为字符串而不是整数。 Therefore it treats a value such as 1,200 (in my converted xls file) differently from 1200 (in my converted txt file) due to the comma in the converted csv excel file.因此,由于转换后的 csv excel 文件中的逗号,它将 1,200(在我转换后的 xls 文件中)与 1200(在我转换后的 txt 文件中)等值不同。
My question is: Is there a way to convert the string interpreted integers, back to being interpreted as integers?我的问题是:有没有办法将字符串解释的整数转换回被解释为整数? Otherwise, is there a way to delete all commas from my xls files?否则,有没有办法从我的 xls 文件中删除所有逗号? I have tried the usual dataframe.replace methodology and it is ineffective.我已经尝试了通常的 dataframe.replace 方法,但它是无效的。
Below is my code:下面是我的代码:
#import required libraries
import datetime
import xlrd
import pandas as pd
#define the time_handle function to name the outputted excel files
time_handle = datetime.datetime.now().strftime("%Y%m%d_%H%M")
#identify XM1 file paths (for both csv origin and excel destination)
XM1_csv = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"
#identify XM2 file paths (for both txt origin and excel destination)
XM2_txt = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"
#remove commas from XM1 excel - failed attempts
#XM1_excel = [col.replace(',', '') for col in XM1_excel]
#XM1_excel = XM1_excel.replace(",", "")
#for line in XM1_excel:
#XM1_excel.write(line.replace(",", ""))
#remove commas from XM1 CSV - failed attempts
#XM1_csv = [col.replace(',', '') for col in XM1_csv]
#XM1_csv = XM1_csv.replace(",", "")
#for line in XM1_csv:
#XM1_excel.write(line.replace(",", ""))
#convert the csv XM1 file to an excel file, in the same folder
pd.read_csv(XM1_csv).to_excel(XM1_excel)
#convert the txt XM2 file to an excel file in the same folder
pd.read_csv(XM2_txt, sep="|").to_excel(XM2_excel)
#confirm XM1 filepath
filepath_XM1 = XM1_excel
#confirm XM2 filepath
filepath_XM2 = XM2_excel
#read relevant columns from the excel files
df1 = pd.read_excel(filepath_XM2, sheetname="Sheet1", parse_cols= "H, J, M, U")
df2 = pd.read_excel(filepath_XM1, sheetname="Sheet1", parse_cols= "C, E, G, K")
#remove all commas from XM1 - failed attempts
#df2 = [col.replace(',', '') for col in df2]
#df2 = df2.replace(",", "")
#for line in df2:
#df2.write(line.replace(",", ""))
#merge the columns from both excel files into one column each respectively
df4 = df1["Exchange Code"] + df1["Product Type"] + df1["Product Description"] + df1["Quantity"].apply(str)
df5 = df2["Exchange"] + df2["Product Type"] + df2["Product Description"] + df2["Quantity"].apply(str)
#concatenate both columns from each excel file, to make one big column containing all the data
df = pd.concat([df4, df5])
#remove all whitespace from each row of the column of data
df=df.str.strip()
df=["".join(x.split()) for x in df]
#convert the data to a dataframe from a series
df = pd.DataFrame({'Value': df})
#remove any duplicates
df.drop_duplicates(subset=None, keep=False, inplace=True)
#print to the console just as a visual aid
print(df)
#output_path = r"filepath"
#print the erroneous entries to an excel file
df.to_excel("XM1_XM2Comparison" + time_handle + ".xls")
Also, I realize the XM1 and XM2 file names with regards to df1 and df2 is a bit confusing, but I simply renamed my files.另外,我意识到关于 df1 和 df2 的 XM1 和 XM2 文件名有点令人困惑,但我只是重命名了我的文件。 It makes sense in terms of the files and where they belong in the code!就文件及其在代码中的位置而言,这是有意义的!
Thank You谢谢你
You can try an argument called converters
on the read-end of the dataframe where you can specify the datatype.您可以在数据帧的读取端尝试一个名为converters
的参数,您可以在其中指定数据类型。 Example:例子:
df= pd.read_excel(file, sheetname=YOUR_SHEET_HERE, converters={'FIELD_NAME': str})
converters
is both in read_csv
and read_excel
converters
在read_csv
和read_excel
I actually solved this issue with a simple fix for future reference.我实际上通过一个简单的修复解决了这个问题,以供将来参考。 when reading the csv using pd.read_csv, I added the thousands method so it looks like this:使用 pd.read_csv 读取 csv 时,我添加了数千个方法,因此它看起来像这样:
pd.read_csv(XM1, thousands = ",").to_excel(XM1_excel)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.