How to automatically skip rows that have non-floating values using pandas read_csv in python?

Question

I have thousands of the .csv files that contain a massive amount of sensory data, all in floating numbers. But there are also some rows in some files show date and time information, and it appears in different locations in files as shown in the image below:

In the above image, the first two rows are in the beginning but can be at other locations for other files. This kind of non-floating rows should be skipped when reading the files using pandas read_csv function to avoid errors.

I used skiprows function to skip specific rows at constant row index, but the problem that the unwanted rows are variable in location through files.

        for j in range(len(all_list)):
        path = os.path.join(path, all_list[j])
        # print(path)
        df_data = pd.read_csv(path, skiprows=[0, 1], header=None)
        print("data shape: ", df_data.shape)

My question is: How can I read only the floating-based rows and automatically skip the non-floating data from all files?

Answer 1

import pandas as pd  

# read csv
df_data = pd.read_csv(path, header=None)

# parse to numeric and set invalid values to NaN
df_data = df_data.apply(pd.to_numeric, errors='coerce')

# drop rows that contain NaN values
df_data = df_data.dropna()

Answer 2

You give any arbitrary/fixed header name while importing to pandas.read_csv and then process the data frame to drop all non-float values

Given Data:

 import pandas as pd
 
 sample_csv=pd.read_csv('USERS.CSV',names=[X','Y','Z'], index_col=False)
 def is_float(x):
    try:
        float(x)
    except ValueError:
        return False
    return True

sample_csv[sample_csv.applymap(lambda x: is_float(x))].dropna()

The Output:

And finally, you can adjust the headers.

Answer 3

You can load the csv as text, remove the unwanted lines and read it with pandas:

for j in range(len(all_list)):
    path = os.path.join(path, all_list[j])
    # print(path)
    with open(path) as f:
        l=f.readlines()
    l=[x[:-1].split(',') for x in l if '$' not in x]
   
    df_data = pd.DataFrame(l, columns=['A', 'B', 'C'])
    print("data shape: ", df_data.shape)

I added if '$' not in x as a filter for unwanted rows, you need to replace it if it is not adequate.

Answer 4

If your data doesn't contain missing values (or if they are going to be dropped anyway), you could make use of read_csv 's converters -functionality with a converter like

def conv(x):
    try:
        return float(x)
    except:
        return None

and a subsequent .dropna() . If NA s are a concern, you could use a different signal value to return and filter those out via some boolean indexing magic

How to automatically skip rows that have non-floating values using pandas read_csv in python?

Question

4 answers

solution1
1 2021-02-01 09:12:59

solution2
1 ACCPTED 2021-02-01 09:42:10

solution3
0 2021-02-01 08:59:20

solution4
0 2021-02-01 09:14:40

How to automatically skip rows that have non-floating values using pandas read_csv in python?

Question

4 answers

solution1 1 2021-02-01 09:12:59

solution2 1 ACCPTED 2021-02-01 09:42:10

solution3 0 2021-02-01 08:59:20

solution4 0 2021-02-01 09:14:40

solution1
1 2021-02-01 09:12:59

solution2
1 ACCPTED 2021-02-01 09:42:10

solution3
0 2021-02-01 08:59:20

solution4
0 2021-02-01 09:14:40