简体   繁体   中英

Python Pandas DataFrame: filter by a Timestamp column with a list of string timestamps

Example setup:

import pandas as pd
df = pd.DataFrame(
    data={'ts':
          [
                '2008-11-05 07:45:23.100',
                '2008-11-17 06:53:25.150',
                '2008-12-02 07:36:18.643',
                '2008-12-15 07:36:24.837',
                '2009-01-06 07:03:47.387',
          ], 
          'val': range(5)})

df.ts = pd.to_datetime(df.ts)

df.set_index('ts', drop=False, inplace=True)

df


                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-11-17 06:53:25.150 | 2008-11-17 06:53:25.150 | 1
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3
2009-01-06 07:03:47.387 | 2009-01-06 07:03:47.387 | 4

Although the index is a pd.Timestamp type, I can use a string representation of a timestamp to filter it. For example:

df.loc['2008-11-05']

                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0

Moreover, pandas comes with a very convenient feature that when my filter is vague it returns the desirable result. For example:

df.loc['2008-12']
                        | ts                      | val
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3

My first question is , how can I filter the df with a list of string timestamps? For example if I run the code below

df.loc[['2008-11-05','2008-12']]

, the result I want to get is

                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3

, but in fact I get the following error:

KeyError: "None of [Index(['2008-11-05', '2008-12'], dtype='object', name='ts')] are in the [index]"

My second question is , can I do the similar filtering logic for a regular column? Ie, if I don't set ts as the index but filter the ts column directly with a string filter.

-------------------- Follow up 2019-9-10 10:00 --------------------

All the answers below are very much appreciated. I didn't know pd.Series.str.startswith can support the tuple input of multiple strings, or that pd.Series.str.contains can support the usage of '|' . New skills learned!

I think all the methods based on the use of astype(str) has one major shortcoming for me: In US people use all kinds of date time formats. Besides '2008-11-05', commonly used ones in my company are '2008-11-5', '11/05/2008', '11/5/2008', '20081105', '05nov2008', which would all fail if I used the string based method.

For now I still have to stick with the following method, which requires the column to be the index and doesn't seem efficient (I haven't profiled), but should be sufficiently robust. I don't understand why it is not supported natively by pandas.

L = ['5nov2008','2008/12']
pd.concat([df.loc[val] for val in L]).drop_duplicates()

                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3

You can use .contains() by first converting them into str

res = df.loc[(df.index.astype(str).str.contains("2008-12")) 
             | (df.index.astype(str).str.contains('2008-11-05'))]
print(res)
                                             ts  val
ts                                                  
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

second question

yes you can apply filter on normal column like

df.loc[(df.ts.astype(str).str.contains("2008-12"))
    |(df.ts.astype(str).str.contains('2008-11-05'))]

This should be get going for you..

>>> df
                       ts  val
0 2008-11-05 07:45:23.100    0
1 2008-11-17 06:53:25.150    1
2 2008-12-02 07:36:18.643    2
3 2008-12-15 07:36:24.837    3
4 2009-01-06 07:03:47.387    4

Result:

>>> df[df.apply(lambda row: row.astype(str).str.contains('2008-11-05')).any(axis=1)]
                       ts  val
0 2008-11-05 07:45:23.100    0

OR ..

>>> df
                                             ts  val
ts
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-11-17 06:53:25.150 2008-11-17 06:53:25.150    1
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3
2009-01-06 07:03:47.387 2009-01-06 07:03:47.387    4

Result:

>>> df[df.apply(lambda row: row.astype(str).str.contains('2008-11-05')).any(axis=1)]
                                             ts  val
ts
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0

Looking for multiple values.

>>> df[df.apply(lambda row: row.astype(str).str.contains('2008-11-05|2008-12')).any(axis=1)]
                                             ts  val
ts
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

For your first question, you could use pd.DataFrame.append :

df.loc['2008-11-05'].append(df.loc['2008-12'])

#                                              ts  val
# ts                                                  
# 2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
# 2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
# 2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

For you second question, you could use pd.Series.str.match :

df.ts.astype(str).str.match('2008-11-05|2008-12')

# ts
# 2008-11-05 07:45:23.100     True
# 2008-11-17 06:53:25.150    False
# 2008-12-02 07:36:18.643     True
# 2008-12-15 07:36:24.837     True
# 2009-01-06 07:03:47.387    False
# Name: ts, dtype: bool

hence using this eg as a boolean index:

df[df.ts.astype(str).str.match('2008-11-05|2008-12')]

#                                              ts  val
# ts                                                  
# 2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
# 2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
# 2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

Note that you can leave out the astype(str) part if your ts column is already of type string.

First idea is simply join together by concat :

df1 = pd.concat([df.loc['2008-11-05'], df.loc['2008-12']], sort=True)
print (df1)
                                             ts  val
ts                                                  
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

Or filter by boolean indexing with mask by Series.str.contains with | for regex OR :

df1 = df[df.index.astype(str).str.contains('2008-11-05|2008-12')]

Or with Series.str.startswith and tuple:

df1 = df[df.index.astype(str).str.startswith(('2008-11-05', '2008-12'))]
print (df1)
                                             ts  val
ts                                                  
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

If input is list of strings:

L = ['2008-11-05','2008-12']

df2 = df[df.ts.astype(str).str.contains('|'.join(L))]

And similar:

df2 = df[df.ts.astype(str).str.startswith(tuple(L))]
print (df2)
                       ts  val
0 2008-11-05 07:45:23.100    0
2 2008-12-02 07:36:18.643    2
3 2008-12-15 07:36:24.837    3

And for column only change index to ts :

df2 = df[df.ts.astype(str).str.contains('2008-11-05|2008-12')]

Or:

df2 = df[df.ts.astype(str).str.startswith(('2008-11-05', '2008-12'))]
print (df2)
                       ts  val
0 2008-11-05 07:45:23.100    0
2 2008-12-02 07:36:18.643    2
3 2008-12-15 07:36:24.837    3

You seem to have stumbled upon a bug!

This works

df.loc['2008-11-05']

This works

df.loc['2008-11-05':'2008-12-15']

but this doesn't, as you mentioned.

df.loc[['2008-11-05','2008-12-15']]

However, you can use as below to get the rows you want.

df.iloc[[0,2,3]]
                                                 ts     val
ts      
2008-11-05 07:45:23.100     2008-11-05 07:45:23.100     0
2008-12-02 07:36:18.643     2008-12-02 07:36:18.643     2
2008-12-15 07:36:24.837     2008-12-15 07:36:24.837     3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM