简体   繁体   中英

Select Dataframe rows in a date range

I have a data frame like the following

   transaction_no  sales_order  is_delivered dispatch_date  remarks  ....

0          2122.0          1.0          True    06-01-2020      NaN   
1          2122.0          1.0          True    06-01-2020      NaN   
2          2122.0          1.0          True    06-01-2020      NaN   
3          2122.0          1.0          True    06-01-2020      NaN   
4          2122.0          1.0          True    06-01-2020      NaN   

I want to select rows based on a date range criteria but I am getting the empty dataframe every time

Here's what I did:

        dt_format = '%Y-%m-%d %H:%M'  

        o_f = datetime.strptime(request.GET['from'], dt_format).strftime('%d/%m/%Y')
        o_t = datetime.strptime(request.GET['to'], dt_format).strftime('%d/%m/%Y')

        f = datetime.strptime(request.GET['from'], dt_format).replace(tzinfo=pytz.UTC).date().strftime("%d-%m-%Y")
        t = datetime.strptime(request.GET['to'], dt_format).replace(tzinfo=pytz.UTC).date().strftime("%d-%m-%Y")


        allot_df = allot_df[allot_df['dispatch_date'].isin(pd.date_range(f, t))]

How can I do that? Better yet why is this not working?

Update: The type of column was str so I changed it to datetime

    allot_df['dispatch_date'] = pd.to_datetime(allot_df['dispatch_date'])
    allot_df = allot_df[allot_df['dispatch_date'].isin(pd.date_range(f, t))]

But now the whole dataframe comes as the output

Assume that just after reading, eg calling pd.read_csv , without any type conversion, your DataFrame contains:

   transaction_no  sales_order  is_delivered dispatch_date
0          2122.0          1.0          True    06-01-2020
1          2123.0          1.0          True    07-01-2020
2          2124.0          1.0          True    08-01-2020
3          2125.0          1.0          True    09-01-2020
4          2126.0          1.0          True    10-01-2020

To check column types run df.info() and the result should be something like:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   transaction_no  5 non-null      float64
 1   sales_order     5 non-null      float64
 2   is_delivered    5 non-null      bool   
 3   dispatch_date   5 non-null      object 
dtypes: bool(1), float64(2), object(1)
memory usage: 165.0+ bytes

Note Dtype for dispatch_date column. It is object (more precisely, something other than a number, and actually - a string ).

A good habit in working with Pandas object is to use its native datetime type, and not to use datetime module. This way your code will run substantially faster than if you used other date/time representation.

So the first step is to convert dispatch_date column from string to datetime . You can do it calling:

df.dispatch_date = pd.to_datetime(df.dispatch_date, dayfirst=True)

Now when you print df , you will get:

   transaction_no  sales_order  is_delivered dispatch_date
0          2122.0          1.0          True    2020-01-06
1          2123.0          1.0          True    2020-01-07
2          2124.0          1.0          True    2020-01-08
3          2125.0          1.0          True    2020-01-09
4          2126.0          1.0          True    2020-01-10

The first thing to notice is that now dispatch_date is printed in year-month-day format, but for now you may be not sure about its type. To check this detail, run df.info() again and the row concerning dispatch_date should be:

3   dispatch_date   5 non-null      datetime64[ns]

And if you want to retrieve rows for particular date range, you can eg:

  • specify both border dates as strings, but also in year-month-day format,
  • call df.query , passing both dates in the query string.

Something like:

df.query("dispatch_date.between('2020-01-07', '2020-01-09')")

The result is:

   transaction_no  sales_order  is_delivered dispatch_date
1          2123.0          1.0          True    2020-01-07
2          2124.0          1.0          True    2020-01-08
3          2125.0          1.0          True    2020-01-09

Note that the ending date is inclusive , contrary to the way how you specify Pandas slices, where the right border is exclusive .

I deliberately didn't go into such details like how to extract both date strings from your source data, this is another issue and you should cope with it alone.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM