简体   繁体   English

如何使用熊猫从字符串中删除小数点

[英]How to remove decimal point from string using pandas

I'm reading an xls file and converting to csv file in databricks using pyspark. 我正在读取xls文件,并使用pyspark在databricks中转换为csv文件。 My input data is of string format 101101114501700 in the xls file. 我的输入数据是xls文件中的字符串格式101101114501700。 But after converting it to CSV format using pandas and writing to the datalake folder my data is showing as 101101114501700.0. 但是使用熊猫将其转换为CSV格式并写入datalake文件夹后,我的数据显示为101101114501700.0。 My code is given below. 我的代码如下。 Please help me why am I getting the decimal part in the data. 请帮我为什么我要在数据中获取小数部分。

for file in os.listdir("/path/to/file"):
     if file.endswith(".xls"):
       filepath = os.path.join("/path/to/file",file)         
       filepath_pd = pd.ExcelFile(filepath)
       names = filepath_pd.sheet_names        
       df = pd.concat([filepath_pd.parse(name) for name in names])        
       df1 = df.to_csv("/path/to/file"+file.split('.')[0]+".csv", sep=',', encoding='utf-8', index=False)
       print(time.strftime("%Y%m%d-%H%M%S") + ": XLS files converted to CSV and moved to folder"

Your question has nothing to do with Spark or PySpark. 您的问题与Spark或PySpark无关。 It's related to Pandas . 这与熊猫有关。

This is because Pandas interpret and infer columns' data type automatically. 这是因为Pandas会自动解释和推断列的数据类型。 Since all the values of your column are numeric, Pandas will consider it as float data type. 由于列的所有值都是数值,因此Pandas会将其视为float数据类型。

To avoid this, pandas.ExcelFile.parse method accepts an argument called converters , you could use this to tell Pandas the specific column data type by: 为避免这种情况, pandas.ExcelFile.parse方法接受一个名为converters的参数,您可以使用它来通过以下方式告诉Pandas特定的列数据类型:

# if you want one specific column as string
df = pd.concat([filepath_pd.parse(name, converters={'column_name': str}) for name in names])

OR 要么

# if you want all columns as string
# and you have multi sheets and they do not have same columns
# this merge all sheets into one dataframe
def get_converters(excel_file, sheet_name, dt_cols):
    cols = excel_file.parse(sheet_name).columns
    converters = {col: str for col in cols if col not in dt_cols}
    for col in dt_cols:
        converters[col] = pd.to_datetime
    return converters

df = pd.concat([filepath_pd.parse(name, converters=get_converters(filepath_pd, name, ['date_column'])) for name in names]).reset_index(drop=True)

OR 要么

# if you want all columns as string
# and all your sheets have same columns
cols = filepath_pd.parse().columns
dt_cols = ['date_column']
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
    converters[col] = pd.to_datetime
df = pd.concat([filepath_pd.parse(name, converters=converters) for name in names]).reset_index(drop=True)

I think the field is automatically parsed as float when reading the excel. 我认为在读取Excel时,该字段会自动解析为float。 I would correct it afterwards: 我会在以后纠正它:

df['column_name'] = df['column_name'].astype(int)

If your column contains Nulls you can´t convert to integer so you will need to fill nulls first: 如果您的列包含Null,则无法转换为整数,因此您需要先填充空值:

df['column_name'] = df['column_name'].fillna(0).astype(int)

Then you can concatenate and store the way you were doing it 然后,您可以连接并存储您的操作方式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM