简体   繁体   中英

Is it possible to pd.read_csv data with a date parameter?

For me, working remotely means accessing big CSV files on a server which take a long time to download to local hard drive.

I've tried to speed this process up using a bit of Python, only reading in particular columns I require. Ideally, however, if I could only read in data for those columns after a date (eg > 2019-01-04) it would significantly reduce the amount of data.

My existing code for this will read in the total file and then apply a date filter. I'm just wondering if it's possible to apply that date filter to the reading of the file in the first place. I appreciate this might not be possible.

Code eg..

import pandas as pd

fields = ['a','b','c'...]
data1 = pd.read_csv(r'SomeForeignDrive.csv', error_bad_lines=False,usecols=fields)
data1['c']=pd.to_datetime(data1['c'], errors='coerce')
data1.dropna()
data1 = data1[data1['c'] > '2019-01-04']
data1.to_csv(r'SomeLocalDrive.csv')

It's not possible to read files starting from a specific date but you can use the following workaround. You can read only the column with dates and find the row index where you want to start from. Then you can read the whole file and skip all rows before the start index:

df = pd.read_csv('path', usecols=['date'])
df['date'] = pd.to_datetime(df['date'])
idx = df[df['date'] > '2019-01-04'].index[0]

df = pd.read_csv('path', skiprows=idx)

read_csv docs:

Using this parameter (usecols) results in much faster parsing time and lower memory usage.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM