简体   繁体   中英

How to get the filtered dataframe for calculations while retaining the original one in pandas?

and I have some confussion on how does pandas use filtered rows. Say we have this market data dataframe 'df':

Time                    Open    High    Low     Close   Volume
31.12.2003 23:00:00.000 82440   83150   82440   82880   47686.32
01.01.2004 23:00:00.000 82830   83100   82350   83100   37571.04
02.01.2004 23:00:00.000 83100   83100   83100   83100   0.00

Now we filter the rows to get a df only for the days that the markets are open (Volume>0)

df=df[df['Volume']>0]

Because of the way we filtered the dataframe, there are empty rows that still have indexes, and values, but they are not used in calculations, for instance, if we do:

df.mean()

The values of the filtered rows wont appear on the calculation.

The confusing part comes here:

How could we do the average of the last 2 values counting from row 3, using only the not filtered values? Meaning, if we filtered out the row 2, it should get the mean of rows 3 and 1.

----------- EDIT -------------- Hey, thanks for the comment, trying to be more clear:

Say we have this example dataframe:

Index    Volume
0        1 
1        0
2        1
3        1

Then we filter it:

df=df[df['Volume']>0]

If we send the dataframe to numpy, in order to plot or work iterate through the dataframe, it will send also the rows that we dont want.

If we iterate over that data, it will also iterate (and consider) the indexes that we are ruling out.

So, how can we get a copy of the dataframe that excludes the ruled out rows, to avoid those two problems?

I think you're running into a pretty common problem with boolean indexing. When you're trying to filter a DataFrame with DataFrame of booleans, you need to specify how to handle cases where things are True for some columns/rows, but false for other columns/rows. Do you want items where things are True everywhere , or anywhere .

It's especially tricky in this case since your DataFrame is 1-d, so you'd expect things to work like a Series , where there's no ambiguity: with a Series a row is either True or False; it can't be True in some columns and False in others.

To resolve the ambiguity with DataFrames , use the any() or all() methods:

In [36]: df
Out[36]: 
       Volume
Index        
0           1
1           0
2           1
3           1

[4 rows x 1 columns]

In [37]: df[(df > 0).all(1)]
Out[37]: 
       Volume
Index        
0           1
2           1
3           1

[3 rows x 1 columns]

The 1 inside the all() just says across the 1 axis (the columns)


Here's a 2-d example that might help clear things up:

In [39]: df = pd.DataFrame({"A": ['a', 'b', 'c', 'd'], "B": ['e', 'f', 'g', 'h']})

In [40]: df
Out[40]: 
   A  B
0  a  e
1  b  f
2  c  g
3  d  h

[4 rows x 2 columns]

In [41]: bf = pd.DataFrame({"A": [True, True, False, False], "B": [True, False, True, False]})

In [42]: bf
Out[42]: 
       A      B
0   True   True
1   True  False
2  False   True
3  False  False

[4 rows x 2 columns]

First, the "wrong" way, with the ambiguity unresolved. It's unclear what to do with (1, 'B') since it's false in bf , but there is a 1 row and a B column, so a NaN is filled:

In [43]: df[bf]
Out[43]: 
     A    B
0    a    e
1    b  NaN
2  NaN    g
3  NaN  NaN

[4 rows x 2 columns]

All matches only the first row, since that's the only one with both True:

In [44]: df[bf.all(1)]
Out[44]: 
   A  B
0  a  e

[1 rows x 2 columns]

any matches all but the last row, since that one has both False es

In [45]: df[bf.any(1)]
Out[45]: 
   A  B
0  a  e
1  b  f
2  c  g

[3 rows x 2 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM