I modified a line from this post to conditionally read rows from a csv file:
filename=r'C:\Users\Nutzer\Desktop\Projects\UK_Traffic_Data\test.csv'
df = (pd.read_csv(filename, error_bad_lines=False) [lambda x: x['Accident_Index'].str.startswith('2005')])
This line works perfectly fine for a small test dataset. However, I do have a big csv file to read and it takes a very long time to read the file. Actually, eventually the NotebookApp.iopub_data_rate_limit
is reached. My questions are:
str.startswith('2005')
. Do you have a suggestion on how to do that?Here is some example data:
The desired output should be a pandas dataframe containing the top six records.
We could initially read just the specific column we want to filter on with the above conditions (assuming this reduces the reading overhead significantly).
#reading the mask column
df_indx = (pd.read_csv(filename, error_bad_lines=False,usecols=['Accident_Index'])
[lambda x: x['Accident_Index'].str.startswith('2005')])
We could then use the values from this column to read the remaining columns from the file using the skiprows and nrows properties since they are sorted values in the input file
df_data= (pd.read_csv(filename,
error_bad_lines=False,header=0,skiprows=df_indx.index[0],nrows=df_indx.shape[0]))
df_data.columns=['Accident_index','data']
This would give a subset of the data we want. We may not need to get the column names separately.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.