Hi all so my dataframe looks like such:
A | B | C | D | E
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8
I haven't found anything quite like this so apologies if this is a duplicate. But essentially I am trying to locate the column that contains the string 'total' (column B) and their adjacent columns (C and D) and turn them into a dataframe. I feel like I am close with the following code:
test.loc[:,test.columns.str.contains('total')]
which isolates the correct column, but i can't quite figure out how to grab the adjacent two columns. My desired output is:
B | C | D
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8
Here's one approach -
from scipy.ndimage.morphology import binary_dilation as bind
mask = test.columns.str.contains('total')
test_out = test.iloc[:,bind(mask,[1,1,1],origin=-1)]
If you don't have access to SciPy
, you can also use np.convolve
, like so -
test_out = test.iloc[:,np.convolve(mask,[1,1,1])[:-2]>0]
Sample runs
Case #1 :
In [390]: np.random.seed(1234)
In [391]: test = pd.DataFrame(np.random.randint(0,9,(3,5)))
In [392]: test.columns = [['P','total001','g','r','t']]
In [393]: test
Out[393]:
P total001 g r t
0 3 6 5 4 8
1 1 7 6 8 0
2 5 0 6 2 0
In [394]: mask = test.columns.str.contains('total')
In [395]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[395]:
total001 g r
0 6 5 4
1 7 6 8
2 0 6 2
Case #2 :
This also works if you have multiple matching columns and also if you are going out of limits and don't have two columns to the right of the matching columns -
In [401]: np.random.seed(1234)
In [402]: test = pd.DataFrame(np.random.randint(0,9,(3,7)))
In [403]: test.columns = [['P','total001','g','r','t','total002','k']]
In [406]: test
Out[406]:
P total001 g r t total002 k
0 3 6 5 4 8 1 7
1 6 8 0 5 0 6 2
2 0 5 2 6 3 7 0
In [407]: mask = test.columns.str.contains('total')
In [408]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[408]:
total001 g r total002 k
0 6 5 4 1 7
1 8 0 5 6 2
2 5 2 6 7 0
OLD answer:
Pandas approach:
In [36]: df = pd.DataFrame(np.random.rand(3,5), columns=['A','total','C','D','E'])
In [37]: df
Out[37]:
A total C D E
0 0.789482 0.427260 0.169065 0.112993 0.142648
1 0.303391 0.484157 0.454579 0.410785 0.827571
2 0.984273 0.001532 0.676777 0.026324 0.094534
In [38]: idx = np.argmax(df.columns.str.contains('total'))
In [39]: df.iloc[:, idx:idx+3]
Out[39]:
total C D
0 0.427260 0.169065 0.112993
1 0.484157 0.454579 0.410785
2 0.001532 0.676777 0.026324
UPDATE:
In [118]: df
Out[118]:
A B C D E
0 NaN USD NaN NaN NaN
1 NaN trading expenses-total NaN NaN NaN
2 A 8.10 2.3 5.5 10.0
3 B 9.1 1.4 6.1 11.0
4 C 5.4 5.1 7.8 12.0
In [119]: col = df.select_dtypes(['object']).apply(lambda x: x.str.contains('total').any()).idxmax()
In [120]: cols = df.columns.to_series().loc[col:].head(3).tolist()
In [121]: col
Out[121]: 'B'
In [122]: cols
Out[122]: ['B', 'C', 'D']
In [123]: df[cols]
Out[123]:
B C D
0 USD NaN NaN
1 trading expenses-total NaN NaN
2 8.10 2.3 5.5
3 9.1 1.4 6.1
4 5.4 5.1 7.8
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.