简体   繁体   中英

Speed up this conditional row read of csv file in Pandas?

I modified a line from this post to conditionally read rows from a csv file:

filename=r'C:\Users\Nutzer\Desktop\Projects\UK_Traffic_Data\test.csv'

df = (pd.read_csv(filename, error_bad_lines=False) [lambda x: x['Accident_Index'].str.startswith('2005')])

This line works perfectly fine for a small test dataset. However, I do have a big csv file to read and it takes a very long time to read the file. Actually, eventually the NotebookApp.iopub_data_rate_limit is reached. My questions are:

  1. Is there a way to improve this code and its performance?
  2. The records in the "Accident_Index" column are sorted. Therefore, it may be a solution to break out of the read statement if a value is reached where "Accident_Index" does not equal str.startswith('2005') . Do you have a suggestion on how to do that?

Here is some example data:

在此处输入图像描述

The desired output should be a pandas dataframe containing the top six records.

We could initially read just the specific column we want to filter on with the above conditions (assuming this reduces the reading overhead significantly).

#reading the mask column
df_indx = (pd.read_csv(filename, error_bad_lines=False,usecols=['Accident_Index'])
           [lambda x: x['Accident_Index'].str.startswith('2005')])

We could then use the values from this column to read the remaining columns from the file using the skiprows and nrows properties since they are sorted values in the input file

df_data= (pd.read_csv(filename,    
         error_bad_lines=False,header=0,skiprows=df_indx.index[0],nrows=df_indx.shape[0]))
df_data.columns=['Accident_index','data']

This would give a subset of the data we want. We may not need to get the column names separately.

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM