I'm trying to brush up my pandas skills by playing around with the open NY Taxi data.
I want to get the data directly from the website in chunks and filter only those rows that happenened in March 2017. When I try to do that for some reason that I do not understand, only 1.000 rows get downloaded. pd.read_csv()
does not seem to download the whole file. It seems like only the first 1.000 rows of the file get processed.
How do I get the whole file processed?
I've read up on how to use pd.read_csv()
to download data in chunks and then iterate over it. I've played around with the chunksize but to no avail. I still only get around 1.000 rows.
chunk_list = []
for chunk in pd.read_csv("https://data.cityofnewyork.us/resource/biws-g3hs.csv", chunksize=100000):
chunk["tpep_pickup_datetime"] =pd.to_datetime(chunk["tpep_pickup_datetime"], format='%Y-%m-%d')
chunk["tpep_dropoff_datetime"]=pd.to_datetime(chunk["tpep_dropoff_datetime"], format='%Y-%m-%d')
chunk_filter=chunk[(chunk["tpep_pickup_datetime"]>="2017-03-01")&(chunk["tpep_pickup_datetime"]<"2017-04-01")]
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk_filter)
df_concat = pd.concat(chunk_list,ignore_index=True)
df_concat.info()
I would expect to access the whole csv file with it 100m+ rows. When I use df_concat.info()
on the result I only ever get 1.000 rows:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
dolocationid 1000 non-null int64
extra 1000 non-null float64
fare_amount 1000 non-null float64
improvement_surcharge 1000 non-null float64
mta_tax 1000 non-null float64
passenger_count 1000 non-null int64
payment_type 1000 non-null int64
pulocationid 1000 non-null int64
ratecodeid 1000 non-null int64
store_and_fwd_flag 1000 non-null object
tip_amount 1000 non-null float64
tolls_amount 1000 non-null float64
total_amount 1000 non-null float64
tpep_dropoff_datetime 1000 non-null datetime64[ns]
tpep_pickup_datetime 1000 non-null datetime64[ns]
trip_distance 1000 non-null float64
vendorid 1000 non-null int64
dtypes: datetime64[ns](2), float64(8), int64(6), object(1)
memory usage: 132.9+ KB
Where do I have to tweak to code to process all rows?
Thanks!
The problem is not about reading. But about the source.
You can manually download and read the file itself (" https://data.cityofnewyork.us/resource/biws-g3hs.csv "), it's only 1000 rows long
You should use this link instead:
pd.read_csv("https://data.cityofnewyork.us/api/views/biws-g3hs/rows.csv?accessType=DOWNLOAD", chunksize=100000)
or better to download it an parse it locally
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.