Reading large datasets from website in pandas returns only 1.000 lines?

Question

I'm trying to brush up my pandas skills by playing around with the open NY Taxi data.

I want to get the data directly from the website in chunks and filter only those rows that happenened in March 2017. When I try to do that for some reason that I do not understand, only 1.000 rows get downloaded. pd.read_csv() does not seem to download the whole file. It seems like only the first 1.000 rows of the file get processed.

How do I get the whole file processed?

I've read up on how to use pd.read_csv() to download data in chunks and then iterate over it. I've played around with the chunksize but to no avail. I still only get around 1.000 rows.

chunk_list = []

for chunk in pd.read_csv("https://data.cityofnewyork.us/resource/biws-g3hs.csv", chunksize=100000):

    chunk["tpep_pickup_datetime"] =pd.to_datetime(chunk["tpep_pickup_datetime"], format='%Y-%m-%d')
    chunk["tpep_dropoff_datetime"]=pd.to_datetime(chunk["tpep_dropoff_datetime"], format='%Y-%m-%d')
    chunk_filter=chunk[(chunk["tpep_pickup_datetime"]>="2017-03-01")&(chunk["tpep_pickup_datetime"]<"2017-04-01")]

    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)

df_concat = pd.concat(chunk_list,ignore_index=True)       

df_concat.info()

I would expect to access the whole csv file with it 100m+ rows. When I use df_concat.info() on the result I only ever get 1.000 rows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
dolocationid             1000 non-null int64
extra                    1000 non-null float64
fare_amount              1000 non-null float64
improvement_surcharge    1000 non-null float64
mta_tax                  1000 non-null float64
passenger_count          1000 non-null int64
payment_type             1000 non-null int64
pulocationid             1000 non-null int64
ratecodeid               1000 non-null int64
store_and_fwd_flag       1000 non-null object
tip_amount               1000 non-null float64
tolls_amount             1000 non-null float64
total_amount             1000 non-null float64
tpep_dropoff_datetime    1000 non-null datetime64[ns]
tpep_pickup_datetime     1000 non-null datetime64[ns]
trip_distance            1000 non-null float64
vendorid                 1000 non-null int64
dtypes: datetime64[ns](2), float64(8), int64(6), object(1)
memory usage: 132.9+ KB

Where do I have to tweak to code to process all rows?

Thanks!

Answer 1

The problem is not about reading. But about the source.

You can manually download and read the file itself (" https://data.cityofnewyork.us/resource/biws-g3hs.csv "), it's only 1000 rows long

You should use this link instead:

pd.read_csv("https://data.cityofnewyork.us/api/views/biws-g3hs/rows.csv?accessType=DOWNLOAD", chunksize=100000)

or better to download it an parse it locally

Reading large datasets from website in pandas returns only 1.000 lines?

Question

1 answers

solution1
0 ACCPTED 2019-04-06 10:13:33

Reading large datasets from website in pandas returns only 1.000 lines?

Question

1 answers

solution1 0 ACCPTED 2019-04-06 10:13:33

solution1
0 ACCPTED 2019-04-06 10:13:33