简体   繁体   中英

How to efficiently filter a pandas dataframe and return a pandas series?

The question seems simple and arguably on the verge of stupid. But given my scenario, it seems that I would have to do exactly that in order to keep a bunch of calculations accross several dataframes efficient.

Scenario:

I've got a bunch of pandas dataframes where the column names are constructed by a name part and a time part such as 'AA_2018' and 'BB_2017' . And I'm doing calculations on different columns from different dataframes so I'll have to filter out the timepart. As an mcve let's just say that I'd like to subract the column containing 'AA' from the column containing 'BB' and ignore all other columns in this dataframe:

import pandas as pd
import numpy as np

dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])

在此处输入图片说明

If i knew the exact name of the columns, this can easily be done using:

diff_series = df['AA_2018'] - df['BB_2017']

This would return a pandas series since I'm using single brackets [] as opposed to a datframe If I had used double brackets [[]] .

My challenge:

diff_series is of type pandas.core.series.Series . But since I've got some filtering to do, I'm using df.filter() that returns a dataframe with one column and not a series:

# in:
colAA = df.filter(like = 'AA')

# out:
# AA_2018
# 2018-01-01  0.801295
# 2018-01-02  0.860808
# 2018-01-03 -0.728886

# in:
# type(colAA)

# out:
# pandas.core.frame.DataFrame

Snce colAA is of type pandas.core.frame.DataFrame , the following returns a dataframe too:

# in:
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA

# out:
            AA_2018  BB_2017
2018-01-01      NaN      NaN
2018-01-02      NaN      NaN
2018-01-03      NaN      NaN    

And that is not what I'm after. This is:

# in: 
diff_series = df['AA_2018'] - df['BB_2017']

# out:
2018-01-01    0.828895
2018-01-02   -1.153436
2018-01-03   -1.159985

Why am I adamant in doing it this way?

Because I'd like to end up with a dataframe using .to_frame() with a specified name based on the filters I've used.

My presumably inefficient approach is this:

# in:

colAA_values = [item for sublist in colAA.values for item in sublist]
# (because colAA.values returns a list of lists)

colBB_values = [item for sublist in colBB.values for item in sublist]

serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)

df_diff = (serBB - serAA).to_frame(name = 'someFilter')

# out:

              someFilter
2018-01-01   -0.828895
2018-01-02    1.153436
2018-01-03    1.159985

What I've tried / What I was hoping to work:

# in:
(df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')

# out:
# AttributeError: 'DataFrame' object has no attribute 'to_frame'

# (Of course because df.filter() returns a one-column dataframe)

I was also hoping that df.filter() could be set to return a pandas series, but no.

I guess I could have asked this questions instead: How to convert pandas dataframe column to a pandas series? But that does not seem to have an efficient built-in oneliner either. Most search results handle the other way around instead. I've been messing around with potential work-arounds for quite some time now, and an obvious solution might be right around the corner, but I'm hoping some of you has a suggestion on how to do this efficiently.

All code elements for an easy copy&paste:

import pandas as pd
import numpy as np

dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])

#diff_series = df[['AA_2018']] - df[['BB_2017']]
#type(diff_series)

colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA

#type(df_filtered)
#type(colAA)
#colAA.values

#colAA.values returns a list of lists that has to be flattened for use in pd.Series
colAA_values = [item for sublist in colAA.values for item in sublist]
colBB_values = [item for sublist in colBB.values for item in sublist]

serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)

df_diff = (serBB - serAA).to_frame(name = 'someFilter')

# Attempts:
# (df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')

You need opposite of to_frame - DataFrame.squeeze - convert one column DataFrame to Series :

colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB.squeeze() - colAA.squeeze()
print (df_filtered)
2018-01-01   -0.479247
2018-01-02   -3.801711
2018-01-03    1.567574
Freq: D, dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM