如何使用熊猫从字符串中删除小数点

Question

我正在读取xls文件，并使用pyspark在databricks中转换为csv文件。 我的输入数据是xls文件中的字符串格式101101114501700。 但是使用熊猫将其转换为CSV格式并写入datalake文件夹后，我的数据显示为101101114501700.0。 我的代码如下。 请帮我为什么我要在数据中获取小数部分。

for file in os.listdir("/path/to/file"):
     if file.endswith(".xls"):
       filepath = os.path.join("/path/to/file",file)         
       filepath_pd = pd.ExcelFile(filepath)
       names = filepath_pd.sheet_names        
       df = pd.concat([filepath_pd.parse(name) for name in names])        
       df1 = df.to_csv("/path/to/file"+file.split('.')[0]+".csv", sep=',', encoding='utf-8', index=False)
       print(time.strftime("%Y%m%d-%H%M%S") + ": XLS files converted to CSV and moved to folder"

Answer 1

您的问题与Spark或PySpark无关。 这与熊猫有关。

这是因为Pandas会自动解释和推断列的数据类型。 由于列的所有值都是数值，因此Pandas会将其视为float数据类型。

为避免这种情况， pandas.ExcelFile.parse方法接受一个名为converters的参数，您可以使用它来通过以下方式告诉Pandas特定的列数据类型：

# if you want one specific column as string
df = pd.concat([filepath_pd.parse(name, converters={'column_name': str}) for name in names])

要么

# if you want all columns as string
# and you have multi sheets and they do not have same columns
# this merge all sheets into one dataframe
def get_converters(excel_file, sheet_name, dt_cols):
    cols = excel_file.parse(sheet_name).columns
    converters = {col: str for col in cols if col not in dt_cols}
    for col in dt_cols:
        converters[col] = pd.to_datetime
    return converters

df = pd.concat([filepath_pd.parse(name, converters=get_converters(filepath_pd, name, ['date_column'])) for name in names]).reset_index(drop=True)

要么

# if you want all columns as string
# and all your sheets have same columns
cols = filepath_pd.parse().columns
dt_cols = ['date_column']
converters = {col: str for col in cols if col not in dt_cols}
for col in dt_cols:
    converters[col] = pd.to_datetime
df = pd.concat([filepath_pd.parse(name, converters=converters) for name in names]).reset_index(drop=True)

Answer 2

我认为在读取Excel时，该字段会自动解析为float。 我会在以后纠正它：

df['column_name'] = df['column_name'].astype(int)

如果您的列包含Null，则无法转换为整数，因此您需要先填充空值：

df['column_name'] = df['column_name'].fillna(0).astype(int)

然后，您可以连接并存储您的操作方式

如何使用熊猫从字符串中删除小数点

问题描述

2 个解决方案

解决方案1
0 已采纳 2019-03-19 08:14:17

解决方案2
0 2019-03-19 08:54:31

如何使用熊猫从字符串中删除小数点

问题描述

2 个解决方案

解决方案1 0 已采纳 2019-03-19 08:14:17

解决方案2 0 2019-03-19 08:54:31

解决方案1
0 已采纳 2019-03-19 08:14:17

解决方案2
0 2019-03-19 08:54:31