简体   繁体   中英

Isolating Adjacent columns based on str.contains

Hi all so my dataframe looks like such:

 A |  B   |   C | D | E
    'USD'
   'trading expenses-total'   
      8.10   2.3   5.5
      9.1    1.4   6.1
      5.4    5.1   7.8

I haven't found anything quite like this so apologies if this is a duplicate. But essentially I am trying to locate the column that contains the string 'total' (column B) and their adjacent columns (C and D) and turn them into a dataframe. I feel like I am close with the following code:

test.loc[:,test.columns.str.contains('total')]

which isolates the correct column, but i can't quite figure out how to grab the adjacent two columns. My desired output is:

 B   |                      C  |  D 
'USD'
'trading expenses-total'   
 8.10                       2.3   5.5
 9.1                        1.4   6.1
 5.4                        5.1   7.8

Here's one approach -

from scipy.ndimage.morphology import binary_dilation as bind

mask = test.columns.str.contains('total')
test_out = test.iloc[:,bind(mask,[1,1,1],origin=-1)]

If you don't have access to SciPy , you can also use np.convolve , like so -

test_out = test.iloc[:,np.convolve(mask,[1,1,1])[:-2]>0]

Sample runs

Case #1 :

In [390]: np.random.seed(1234)

In [391]: test = pd.DataFrame(np.random.randint(0,9,(3,5)))

In [392]: test.columns = [['P','total001','g','r','t']]

In [393]: test
Out[393]: 
   P  total001  g  r  t
0  3         6  5  4  8
1  1         7  6  8  0
2  5         0  6  2  0

In [394]: mask = test.columns.str.contains('total')

In [395]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[395]: 
   total001  g  r
0         6  5  4
1         7  6  8
2         0  6  2

Case #2 :

This also works if you have multiple matching columns and also if you are going out of limits and don't have two columns to the right of the matching columns -

In [401]: np.random.seed(1234)

In [402]: test = pd.DataFrame(np.random.randint(0,9,(3,7)))

In [403]: test.columns = [['P','total001','g','r','t','total002','k']]

In [406]: test
Out[406]: 
   P  total001  g  r  t  total002  k
0  3         6  5  4  8         1  7
1  6         8  0  5  0         6  2
2  0         5  2  6  3         7  0

In [407]: mask = test.columns.str.contains('total')

In [408]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[408]: 
   total001  g  r  total002  k
0         6  5  4         1  7
1         8  0  5         6  2
2         5  2  6         7  0

OLD answer:

Pandas approach:

In [36]: df = pd.DataFrame(np.random.rand(3,5), columns=['A','total','C','D','E'])

In [37]: df
Out[37]:
          A     total         C         D         E
0  0.789482  0.427260  0.169065  0.112993  0.142648
1  0.303391  0.484157  0.454579  0.410785  0.827571
2  0.984273  0.001532  0.676777  0.026324  0.094534

In [38]: idx = np.argmax(df.columns.str.contains('total'))

In [39]: df.iloc[:, idx:idx+3]
Out[39]:
      total         C         D
0  0.427260  0.169065  0.112993
1  0.484157  0.454579  0.410785
2  0.001532  0.676777  0.026324

UPDATE:

In [118]: df
Out[118]:
     A                       B    C    D     E
0  NaN                     USD  NaN  NaN   NaN
1  NaN  trading expenses-total  NaN  NaN   NaN
2    A                    8.10  2.3  5.5  10.0
3    B                     9.1  1.4  6.1  11.0
4    C                     5.4  5.1  7.8  12.0

In [119]: col = df.select_dtypes(['object']).apply(lambda x: x.str.contains('total').any()).idxmax()

In [120]: cols = df.columns.to_series().loc[col:].head(3).tolist()

In [121]: col
Out[121]: 'B'

In [122]: cols
Out[122]: ['B', 'C', 'D']

In [123]: df[cols]
Out[123]:
                        B    C    D
0                     USD  NaN  NaN
1  trading expenses-total  NaN  NaN
2                    8.10  2.3  5.5
3                     9.1  1.4  6.1
4                     5.4  5.1  7.8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM