简体   繁体   中英

Selecting Subset of Pandas DataFrame

I have two different pandas DataFrames and I want to extract data from one DataFrame whenever the other DataFrame has a specific value at the same time.To be concrete, I have one object called "GDP" which looks as follows:

               GDP
DATE               
1947-01-01    243.1
1947-04-01    246.3
1947-07-01    250.1

I additionally have a DataFrame called "recession" which contains data like the following:

            USRECQ
DATE         
1949-07-01       1
1949-10-01       1
1950-01-01       0

I want to create two new time series. One should contain GDP data whenever USRECQ has a value of 0 at the same DATE. The other one should contain GDP data whenever USRECQ has a value of 1 at the same DATE. How can I do that?

Let's modify the example you posted so the dates overlap:

import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
                   index=pd.date_range('2000-1-1', periods=10, freq='D'))

#             GDP
# 2000-01-01    0
# 2000-01-02   10
# 2000-01-03   20
# 2000-01-04   30
# 2000-01-05   40
# 2000-01-06   50
# 2000-01-07   60
# 2000-01-08   70
# 2000-01-09   80
# 2000-01-10   90

recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
                         index=pd.date_range('2000-1-2', periods=10, freq='D'))
#             USRECQ
# 2000-01-02       0
# 2000-01-03       0
# 2000-01-04       0
# 2000-01-05       0
# 2000-01-06       0
# 2000-01-07       1
# 2000-01-08       1
# 2000-01-09       1
# 2000-01-10       1
# 2000-01-11       1

Then you could join the two dataframes:

combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
#             GDP  USRECQ
# 2000-01-01    0     NaN
# 2000-01-02   10       0
# 2000-01-03   20       0
# 2000-01-04   30       0
# 2000-01-05   40       0
# 2000-01-06   50       0
# 2000-01-07   60       1
# 2000-01-08   70       1
# 2000-01-09   80       1
# 2000-01-10   90       1
# 2000-01-11  NaN       1

and select rows based on a condition like this:

In [112]: combined.loc[combined['USRECQ']==0]
Out[112]: 
            GDP  USRECQ
2000-01-02   10       0
2000-01-03   20       0
2000-01-04   30       0
2000-01-05   40       0
2000-01-06   50       0

In [113]: combined.loc[combined['USRECQ']==1]
Out[113]: 
            GDP  USRECQ
2000-01-07   60       1
2000-01-08   70       1
2000-01-09   80       1
2000-01-10   90       1
2000-01-11  NaN       1

To get just the GDP column supply the column name as the second term to combined.loc :

In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]: 
2000-01-07    60
2000-01-08    70
2000-01-09    80
2000-01-10    90
2000-01-11   NaN
Freq: D, Name: GDP, dtype: float64

As PaulH points out, you could also use query , which has a nicer syntax:

In [118]: combined.query('USRECQ==1')
Out[118]: 
            GDP  USRECQ
2000-01-07   60       1
2000-01-08   70       1
2000-01-09   80       1
2000-01-10   90       1
2000-01-11  NaN       1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM