简体   繁体   中英

Reading large datasets from website in pandas returns only 1.000 lines?

I'm trying to brush up my pandas skills by playing around with the open NY Taxi data.

I want to get the data directly from the website in chunks and filter only those rows that happenened in March 2017. When I try to do that for some reason that I do not understand, only 1.000 rows get downloaded. pd.read_csv() does not seem to download the whole file. It seems like only the first 1.000 rows of the file get processed.

How do I get the whole file processed?

I've read up on how to use pd.read_csv() to download data in chunks and then iterate over it. I've played around with the chunksize but to no avail. I still only get around 1.000 rows.

chunk_list = []

for chunk in pd.read_csv("https://data.cityofnewyork.us/resource/biws-g3hs.csv", chunksize=100000):

    chunk["tpep_pickup_datetime"] =pd.to_datetime(chunk["tpep_pickup_datetime"], format='%Y-%m-%d')
    chunk["tpep_dropoff_datetime"]=pd.to_datetime(chunk["tpep_dropoff_datetime"], format='%Y-%m-%d')
    chunk_filter=chunk[(chunk["tpep_pickup_datetime"]>="2017-03-01")&(chunk["tpep_pickup_datetime"]<"2017-04-01")]

    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)

df_concat = pd.concat(chunk_list,ignore_index=True)       

df_concat.info()

I would expect to access the whole csv file with it 100m+ rows. When I use df_concat.info() on the result I only ever get 1.000 rows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
dolocationid             1000 non-null int64
extra                    1000 non-null float64
fare_amount              1000 non-null float64
improvement_surcharge    1000 non-null float64
mta_tax                  1000 non-null float64
passenger_count          1000 non-null int64
payment_type             1000 non-null int64
pulocationid             1000 non-null int64
ratecodeid               1000 non-null int64
store_and_fwd_flag       1000 non-null object
tip_amount               1000 non-null float64
tolls_amount             1000 non-null float64
total_amount             1000 non-null float64
tpep_dropoff_datetime    1000 non-null datetime64[ns]
tpep_pickup_datetime     1000 non-null datetime64[ns]
trip_distance            1000 non-null float64
vendorid                 1000 non-null int64
dtypes: datetime64[ns](2), float64(8), int64(6), object(1)
memory usage: 132.9+ KB

Where do I have to tweak to code to process all rows?

Thanks!

The problem is not about reading. But about the source.

You can manually download and read the file itself (" https://data.cityofnewyork.us/resource/biws-g3hs.csv "), it's only 1000 rows long

You should use this link instead:

pd.read_csv("https://data.cityofnewyork.us/api/views/biws-g3hs/rows.csv?accessType=DOWNLOAD", chunksize=100000)

or better to download it an parse it locally

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM