Pandas read_csv(): drop row if doesn't match a schema

Question

I have a csv file that I need to read and parse as Pandas dataframe. Theoretically, all the columns should follow a known schema of numerical data and strings. I know that some records are broken, either with less number of fields or wrong order.

What I would like to do is to get rid of all these problematic rows.

As a reference, on PySpark I used to use 'DROPMALFORMED' to filter out records that don't match the schema.

dataSchema = StructType([ 
    StructField("col1", LongType(), True), 
    StructField("col2", StringType(), True)])

dataFrame = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .options(header='false', delimiter='\t', mode='DROPMALFORMED') \
    .load(filename, schema = dataSchema)

With Pandas, I cannot find a simple way to do so. For instance, I thought the this snippet would do the trick but instead it just copies back the wrong value instead of dropping it.

dataFrame['col1'] = dataFrame['col1'].astype(np.int64, errors='ignore')

Answer 1

May be pandas.to_numeric will help. It has errors='coerce' option, which replaces all wrong values with NaN . Than, you can use dropna() function to remove rows containing NaN :

import pandas as pd
df=pd.DataFrame([[1,2,3],[4,5,6],[7,'F',8]],columns=['col1','col2','col3'])
df['col2']=pd.to_numeric(df['col2'],errors='coerce')
df.dropna(inplace=True)

Pandas read_csv(): drop row if doesn't match a schema

Question

1 answers

solution1
1 ACCPTED 2019-02-16 17:46:12

Pandas read_csv(): drop row if doesn't match a schema

Question

1 answers

solution1 1 ACCPTED 2019-02-16 17:46:12

solution1
1 ACCPTED 2019-02-16 17:46:12