简体   繁体   English

如何使用 Pandas 清理坐标系的 CSV 文件?

[英]How to clean CSV file for a coordinate system using pandas?

I wanted to create a program to convert CSV files to DXF(AutoCAD), but the CSV file sometimes comes with a header and sometimes no and there are cells that cannot be empty such as coordinates, and I also noticed that after excluding some of the inputs the value is nan or NaN and it was necessary to get rid of them so I offer you my answer and please share your opinions to implement a better method.我想创建一个程序来将 CSV 文件转换为 DXF(AutoCAD),但是 CSV 文件有时带有标题,有时带有标题,有时没有,并且有些单元格不能为空,例如坐标,我还注意到在排除了一些输入值是 nan 或 NaN 并且有必要摆脱它们所以我为您提供我的答案并请分享您的意见以实现更好的方法。

sample input样本输入

在此处输入图片说明

output输出

在此处输入图片说明

solution解决方案

import string
import pandas


def pandas_clean_csv(csv_file):
    """
    Function pandas_clean_csv Documentation
    - I Got help from this site, it's may help you as well:
    Get the row with the largest number of missing data for more Documentation
    https://moonbooks.org/Articles/How-to-filter-missing-data-NAN-or-NULL-values-in-a-pandas-DataFrame-/
    """
    try:
        if not csv_file.endswith('.csv'):
            raise TypeError("Be sure you select .csv file")
        
        # get punctuations marks as list !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
        punctuations_list = [mark for mark in string.punctuation]
    
        # import csv file and read it by pandas
        data_frame = pandas.read_csv(
            filepath_or_buffer=csv_file,
            header=None,
            skip_blank_lines=True,
            error_bad_lines=True,
            encoding='utf8',
            na_values=punctuations_list
        )
        
        # if elevation column is NaN convert it to 0
        data_frame[3] = data_frame.iloc[:, [3]].fillna(0)
        
        # if Description column is NaN convert it to -
        data_frame[4] = data_frame.iloc[:, [4]].fillna('-')
        
        # select coordinates columns
        coord_columns = data_frame.iloc[:, [1, 2]]
        
        # convert coordinates columns to numeric type
        coord_columns = coord_columns.apply(pandas.to_numeric, errors='coerce', axis=1)
        
        # Find rows with missing data
        index_with_nan = coord_columns.index[coord_columns.isnull().any(axis=1)]
        
        # Remove rows with missing data
        data_frame.drop(index_with_nan, 0, inplace=True)
        
        # iterate data frame as tuple data
        output_clean_csv = data_frame.itertuples(index=False)
        
        return output_clean_csv
    
    except Exception as E:
        print(f"Error: {E}")
        exit(1)


out_data = pandas_clean_csv('csv_files/version2_bad_headers.csl')

for i in out_data:
    print(i[0], i[1], i[2], i[3], i[4])

Here you can Download my test CSV files在这里你可以下载我的测试 CSV文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM