简体   繁体   English

如何使用 python 中的 pandas read_csv 自动跳过具有非浮点值的行?

[英]How to automatically skip rows that have non-floating values using pandas read_csv in python?

I have thousands of the .csv files that contain a massive amount of sensory data, all in floating numbers.我有数千个.csv文件,其中包含大量的感官数据,全部都是浮点数。 But there are also some rows in some files show date and time information, and it appears in different locations in files as shown in the image below:但也有一些文件中的一些行显示日期和时间信息,并且它出现在文件中的不同位置,如下图所示:

在此处输入图像描述

In the above image, the first two rows are in the beginning but can be at other locations for other files.在上图中,前两行位于开头,但可以位于其他文件的其他位置。 This kind of non-floating rows should be skipped when reading the files using pandas read_csv function to avoid errors.使用 pandas read_csv function 读取文件时应跳过这种非浮动行以避免错误。

I used skiprows function to skip specific rows at constant row index, but the problem that the unwanted rows are variable in location through files.我使用skiprows function 跳过恒定行索引处的特定行,但问题是不需要的行在文件中的位置是可变的。

        for j in range(len(all_list)):
        path = os.path.join(path, all_list[j])
        # print(path)
        df_data = pd.read_csv(path, skiprows=[0, 1], header=None)
        print("data shape: ", df_data.shape)

My question is: How can I read only the floating-based rows and automatically skip the non-floating data from all files?我的问题是:如何仅读取基于浮动的行并自动跳过所有文件中的非浮动数据?

import pandas as pd  

# read csv
df_data = pd.read_csv(path, header=None)

# parse to numeric and set invalid values to NaN
df_data = df_data.apply(pd.to_numeric, errors='coerce')

# drop rows that contain NaN values
df_data = df_data.dropna()

You give any arbitrary/fixed header name while importing to pandas.read_csv and then process the data frame to drop all non-float values您在导入到pandas.read_csv时提供任意/固定的 header 名称,然后处理数据帧以删除所有非浮点值

Given Data:给定数据: 在此处输入图像描述

 import pandas as pd
 
 sample_csv=pd.read_csv('USERS.CSV',names=[X','Y','Z'], index_col=False)
 def is_float(x):
    try:
        float(x)
    except ValueError:
        return False
    return True

sample_csv[sample_csv.applymap(lambda x: is_float(x))].dropna()

The Output: Output: 在此处输入图像描述

And finally, you can adjust the headers.最后,您可以调整标题。

You can load the csv as text, remove the unwanted lines and read it with pandas:您可以将 csv 加载为文本,删除不需要的行并使用 pandas 阅读:

for j in range(len(all_list)):
    path = os.path.join(path, all_list[j])
    # print(path)
    with open(path) as f:
        l=f.readlines()
    l=[x[:-1].split(',') for x in l if '$' not in x]
   
    df_data = pd.DataFrame(l, columns=['A', 'B', 'C'])
    print("data shape: ", df_data.shape)

I added if '$' not in x as a filter for unwanted rows, you need to replace it if it is not adequate.我添加了 if '$' not in x作为不需要的行的过滤器,如果它不够用,则需要替换它。

If your data doesn't contain missing values (or if they are going to be dropped anyway), you could make use of read_csv 's converters -functionality with a converter like如果您的数据不包含缺失值(或者无论如何它们都将被删除),您可以使用read_csvconverters - 功能与转换器类似

def conv(x):
    try:
        return float(x)
    except:
        return None

and a subsequent .dropna() .和随后的.dropna() If NA s are a concern, you could use a different signal value to return and filter those out via some boolean indexing magic如果NA是一个问题,您可以使用不同的信号值返回并通过一些 boolean 索引魔术过滤掉它们

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM