简体   繁体   中英

How to automatically skip rows that have non-floating values using pandas read_csv in python?

I have thousands of the .csv files that contain a massive amount of sensory data, all in floating numbers. But there are also some rows in some files show date and time information, and it appears in different locations in files as shown in the image below:

在此处输入图像描述

In the above image, the first two rows are in the beginning but can be at other locations for other files. This kind of non-floating rows should be skipped when reading the files using pandas read_csv function to avoid errors.

I used skiprows function to skip specific rows at constant row index, but the problem that the unwanted rows are variable in location through files.

        for j in range(len(all_list)):
        path = os.path.join(path, all_list[j])
        # print(path)
        df_data = pd.read_csv(path, skiprows=[0, 1], header=None)
        print("data shape: ", df_data.shape)

My question is: How can I read only the floating-based rows and automatically skip the non-floating data from all files?

import pandas as pd  

# read csv
df_data = pd.read_csv(path, header=None)

# parse to numeric and set invalid values to NaN
df_data = df_data.apply(pd.to_numeric, errors='coerce')

# drop rows that contain NaN values
df_data = df_data.dropna()

You give any arbitrary/fixed header name while importing to pandas.read_csv and then process the data frame to drop all non-float values

Given Data: 在此处输入图像描述

 import pandas as pd
 
 sample_csv=pd.read_csv('USERS.CSV',names=[X','Y','Z'], index_col=False)
 def is_float(x):
    try:
        float(x)
    except ValueError:
        return False
    return True

sample_csv[sample_csv.applymap(lambda x: is_float(x))].dropna()

The Output: 在此处输入图像描述

And finally, you can adjust the headers.

You can load the csv as text, remove the unwanted lines and read it with pandas:

for j in range(len(all_list)):
    path = os.path.join(path, all_list[j])
    # print(path)
    with open(path) as f:
        l=f.readlines()
    l=[x[:-1].split(',') for x in l if '$' not in x]
   
    df_data = pd.DataFrame(l, columns=['A', 'B', 'C'])
    print("data shape: ", df_data.shape)

I added if '$' not in x as a filter for unwanted rows, you need to replace it if it is not adequate.

If your data doesn't contain missing values (or if they are going to be dropped anyway), you could make use of read_csv 's converters -functionality with a converter like

def conv(x):
    try:
        return float(x)
    except:
        return None

and a subsequent .dropna() . If NA s are a concern, you could use a different signal value to return and filter those out via some boolean indexing magic

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM