Speed up this conditional row read of csv file in Pandas?

Question

I modified a line from this post to conditionally read rows from a csv file:

filename=r'C:\Users\Nutzer\Desktop\Projects\UK_Traffic_Data\test.csv'

df = (pd.read_csv(filename, error_bad_lines=False) [lambda x: x['Accident_Index'].str.startswith('2005')])

This line works perfectly fine for a small test dataset. However, I do have a big csv file to read and it takes a very long time to read the file. Actually, eventually the NotebookApp.iopub_data_rate_limit is reached. My questions are:

Is there a way to improve this code and its performance?
The records in the "Accident_Index" column are sorted. Therefore, it may be a solution to break out of the read statement if a value is reached where "Accident_Index" does not equal str.startswith('2005') . Do you have a suggestion on how to do that?

Here is some example data:

The desired output should be a pandas dataframe containing the top six records.

Answer 1

We could initially read just the specific column we want to filter on with the above conditions (assuming this reduces the reading overhead significantly).

#reading the mask column
df_indx = (pd.read_csv(filename, error_bad_lines=False,usecols=['Accident_Index'])
           [lambda x: x['Accident_Index'].str.startswith('2005')])

We could then use the values from this column to read the remaining columns from the file using the skiprows and nrows properties since they are sorted values in the input file

df_data= (pd.read_csv(filename,    
         error_bad_lines=False,header=0,skiprows=df_indx.index[0],nrows=df_indx.shape[0]))
df_data.columns=['Accident_index','data']

This would give a subset of the data we want. We may not need to get the column names separately.

Speed up this conditional row read of csv file in Pandas?

Question

1 answers

solution1
1 ACCPTED 2020-10-09 18:55:37

Speed up this conditional row read of csv file in Pandas?

Question

1 answers

solution1 1 ACCPTED 2020-10-09 18:55:37

solution1
1 ACCPTED 2020-10-09 18:55:37