Filtering columns based on row values in Pandas

Question

I am trying to create dataframes from this "master" dataframe based on unique entries in the row 2.

    DATE    PROP1   PROP1   PROP1   PROP1   PROP1   PROP1   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2
1   DAYS    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN
2       UNIT1   UNIT2   UNIT3   UNIT4   UNIT5   UNIT6   UNIT7   UNIT8   UNIT3   UNIT4   UNIT11  UNIT12  UNIT1   UNIT2
3                                                           
4   1/1/2020    677 92  342 432 878 831 293 88  69  621 586 576 972 733
5   2/1/2020    515 11  86  754 219 818 822 280 441 11  123 36  430 272
6   3/1/2020    253 295 644 401 574 184 354 12  680 729 823 822 174 602
7   4/1/2020    872 568 505 652 366 982 159 131 218 961 52  85  679 923
8   5/1/2020    93  58  864 682 346 19  293 19  206 500 793 962 630 413
9   6/1/2020    696 262 833 418 876 695 900 781 179 138 143 526 9   866
10  7/1/2020    810 58  579 244 81  858 362 440 186 425 55  920 345 596
11  8/1/2020    834 609 618 214 547 834 301 875 783 216 834 609 550 274
12  9/1/2020    687 935 976 380 885 246 339 904 627 460 659 352 361 793
13  10/1/2020   596 300 810 248 475 718 350 574 825 804 245 209 212 925
14  11/1/2020   584 984 711 879 916 107 277 412 122 683 151 811 129 4
15  12/1/2020   616 515 101 743 650 526 475 991 796 227 880 692 734 799
16  1/1/2021    106 441 305 964 452 249 282 486 374 620 652 793 115 697
17  2/1/2021    969 504 936 678 67  42  985 791 709 689 520 503 102 731
18  3/1/2021    823 169 412 177 783 601 613 251 533 463 13  127 516 15
19  4/1/2021    348 588 140 966 143 576 419 611 128 830 68  209 952 935
20  5/1/2021    96  711 651 121 708 360 159 229 552 951 79  665 709 165
21  6/1/2021    805 657 729 629 249 547 581 583 236 828 636 248 412 535
22  7/1/2021    286 320 908 765 336 286 148 168 821 567 63  908 248 320
23  8/1/2021    707 975 565 699 47  712 700 439 497 106 288 105 872 158
24  9/1/2021    346 523 142 181 904 266 28  740 125 64  287 707 553 437
25  10/1/2021   245 42  773 591 492 512 846 487 983 180 372 306 785 691
26  11/1/2021   785 577 448 489 425 205 672 358 868 637 104 422 873 919

so the output will look something like this

df_unit1

    DATE    PROP1   PROP2
1   DAYS    MEAN    MEAN
2       UNIT1   UNIT1
3           
4   1/1/2020    677 972
5   2/1/2020    515 430
6   3/1/2020    253 174
7   4/1/2020    872 679
8   5/1/2020    93  630
9   6/1/2020    696 9
10  7/1/2020    810 345
11  8/1/2020    834 550
12  9/1/2020    687 361
13  10/1/2020   596 212
14  11/1/2020   584 129
15  12/1/2020   616 734
16  1/1/2021    106 115
17  2/1/2021    969 102
18  3/1/2021    823 516
19  4/1/2021    348 952
20  5/1/2021    96  709
21  6/1/2021    805 412
22  7/1/2021    286 248
23  8/1/2021    707 872
24  9/1/2021    346 553
25  10/1/2021   245 785
26  11/1/2021   785 873

df_unit2

    DATE    PROP1   PROP2
1   DAYS    MEAN    MEAN
2       UNIT2   UNIT2
3           
4   1/1/2020    92  733
5   2/1/2020    11  272
6   3/1/2020    295 602
7   4/1/2020    568 923
8   5/1/2020    58  413
9   6/1/2020    262 866
10  7/1/2020    58  596
11  8/1/2020    609 274
12  9/1/2020    935 793
13  10/1/2020   300 925
14  11/1/2020   984 4
15  12/1/2020   515 799
16  1/1/2021    441 697
17  2/1/2021    504 731
18  3/1/2021    169 15
19  4/1/2021    588 935
20  5/1/2021    711 165
21  6/1/2021    657 535
22  7/1/2021    320 320
23  8/1/2021    975 158
24  9/1/2021    523 437
25  10/1/2021   42  691
26  11/1/2021   577 919

I have extracted the unique units from the row

unitName = pd.Series(pd.Series(df[2,:]).unique(), name = "Unit Names")
unitName = unitName.tolist()

Next I was planning to loop through this list of unique units and create dataframes with each units

for unit in unitName:
   df_unit = df.iloc[[df.iloc[2:,:].str.match(unit)],:]
   print(df_unit)

I am getting an error that 'DataFrame' object has no attribute 'str'. So my plan was to match all the cells in row2 that matches a given unit and then extract the entire column for the matched row cell.

Answer 1

This response has two parts:

Solution 1: Strip columns based on common name in dataframe

With the assumption that your dataframe columns look as follows:

['DATE DAYS', 'PROP1 MEAN UNIT1', 'PROP1 MEAN UNIT2', 'PROP1 MEAN UNIT3', 'PROP1 MEAN UNIT4', 'PROP1 MEAN UNIT5', 'PROP1 MEAN UNIT6', 'PROP2 MEAN UNIT7', 'PROP2 MEAN UNIT8', 'PROP2 MEAN UNIT3', 'PROP2 MEAN UNIT4', 'PROP2 MEAN UNIT11', 'PROP2 MEAN UNIT12', 'PROP2 MEAN UNIT1', 'PROP2 MEAN UNIT2']

and the first few records of your dataframe looks like this...

    DATE DAYS PROP1 MEAN UNIT1  ... PROP2 MEAN UNIT1 PROP2 MEAN UNIT2
0    1/1/2020              677  ...              972              733
1    2/1/2020              515  ...              430              272
2    3/1/2020              253  ...              174              602
3    4/1/2020              872  ...              679              923
4    5/1/2020               93  ...              630              413
5    6/1/2020              696  ...                9              866
6    7/1/2020              810  ...              345              596

The following lines of code should give you what you want:

cols = df.columns.tolist()

units = sorted(set(x[x.rfind('UNIT'):] for x in cols[1:]))

s_units = sorted(cols[1:],key = lambda x: x.split()[2])

for i in units:
    unit_sublist = ['DATE DAYS'] + [j for j in s_units if j[-6:].strip() == i]
    print ('df_' + i.lower())
    print (df[unit_sublist])

I got the following:

df_unit1
    DATE DAYS PROP1 MEAN UNIT1 PROP2 MEAN UNIT1
0    1/1/2020              677              972
1    2/1/2020              515              430
2    3/1/2020              253              174
3    4/1/2020              872              679
4    5/1/2020               93              630
5    6/1/2020              696                9
6    7/1/2020              810              345

df_unit11
    DATE DAYS PROP2 MEAN UNIT11
0    1/1/2020               586
1    2/1/2020               123
2    3/1/2020               823
3    4/1/2020                52
4    5/1/2020               793
5    6/1/2020               143
6    7/1/2020                55

df_unit12
    DATE DAYS PROP2 MEAN UNIT12
0    1/1/2020               576
1    2/1/2020                36
2    3/1/2020               822
3    4/1/2020                85
4    5/1/2020               962
5    6/1/2020               526
6    7/1/2020               920

df_unit2
    DATE DAYS PROP1 MEAN UNIT2 PROP2 MEAN UNIT2
0    1/1/2020               92              733
1    2/1/2020               11              272
2    3/1/2020              295              602
3    4/1/2020              568              923
4    5/1/2020               58              413
5    6/1/2020              262              866
6    7/1/2020               58              596

df_unit3
    DATE DAYS PROP1 MEAN UNIT3 PROP2 MEAN UNIT3
0    1/1/2020              342               69
1    2/1/2020               86              441
2    3/1/2020              644              680
3    4/1/2020              505              218
4    5/1/2020              864              206
5    6/1/2020              833              179
6    7/1/2020              579              186

df_unit4
    DATE DAYS PROP1 MEAN UNIT4 PROP2 MEAN UNIT4
0    1/1/2020              432              621
1    2/1/2020              754               11
2    3/1/2020              401              729
3    4/1/2020              652              961
4    5/1/2020              682              500
5    6/1/2020              418              138
6    7/1/2020              244              425

df_unit5
    DATE DAYS PROP1 MEAN UNIT5
0    1/1/2020              878
1    2/1/2020              219
2    3/1/2020              574
3    4/1/2020              366
4    5/1/2020              346
5    6/1/2020              876
6    7/1/2020               81

df_unit6
    DATE DAYS PROP1 MEAN UNIT6
0    1/1/2020              831
1    2/1/2020              818
2    3/1/2020              184
3    4/1/2020              982
4    5/1/2020               19
5    6/1/2020              695
6    7/1/2020              858

df_unit7
    DATE DAYS PROP2 MEAN UNIT7
0    1/1/2020              293
1    2/1/2020              822
2    3/1/2020              354
3    4/1/2020              159
4    5/1/2020              293
5    6/1/2020              900
6    7/1/2020              362

df_unit8
    DATE DAYS PROP2 MEAN UNIT8
0    1/1/2020               88
1    2/1/2020              280
2    3/1/2020               12
3    4/1/2020              131
4    5/1/2020               19
5    6/1/2020              781
6    7/1/2020              440

Solution 2: Create column names based on first 3 rows in the source data

Let us assume the first 6 rows of your dataframe looks like this.

DATE    PROP1   PROP1   PROP1   PROP1   PROP1   PROP1   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2
DAYS    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN
        UNIT1   UNIT2   UNIT3   UNIT4   UNIT5   UNIT6   UNIT7   UNIT8   UNIT3   UNIT4   UNIT11  UNIT12  UNIT1   UNIT2
4   1/1/2020    677 92  342 432 878 831 293 88  69  621 586 576 972 733
5   2/1/2020    515 11  86  754 219 818 822 280 441 11  123 36  430 272
6   3/1/2020    253 295 644 401 574 184 354 12  680 729 823 822 174 602

Then you can write the below code to create the dataframe.

data = '''DATE    PROP1   PROP1   PROP1   PROP1   PROP1   PROP1   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2   PROP2
DAYS    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN    MEAN
        UNIT1   UNIT2   UNIT3   UNIT4   UNIT5   UNIT6   UNIT7   UNIT8   UNIT3   UNIT4   UNIT11  UNIT12  UNIT1   UNIT2
4   1/1/2020    677 92  342 432 878 831 293 88  69  621 586 576 972 733
5   2/1/2020    515 11  86  754 219 818 822 280 441 11  123 36  430 272
6   3/1/2020    253 295 644 401 574 184 354 12  680 729 823 822 174 602
7   4/1/2020    872 568 505 652 366 982 159 131 218 961 52  85  679 923
8   5/1/2020    93  58  864 682 346 19  293 19  206 500 793 962 630 413
9   6/1/2020    696 262 833 418 876 695 900 781 179 138 143 526 9   866
10  7/1/2020    810 58  579 244 81  858 362 440 186 425 55  920 345 596
11  8/1/2020    834 609 618 214 547 834 301 875 783 216 834 609 550 274
12  9/1/2020    687 935 976 380 885 246 339 904 627 460 659 352 361 793
13  10/1/2020   596 300 810 248 475 718 350 574 825 804 245 209 212 925
14  11/1/2020   584 984 711 879 916 107 277 412 122 683 151 811 129 4
15  12/1/2020   616 515 101 743 650 526 475 991 796 227 880 692 734 799
16  1/1/2021    106 441 305 964 452 249 282 486 374 620 652 793 115 697
17  2/1/2021    969 504 936 678 67  42  985 791 709 689 520 503 102 731
18  3/1/2021    823 169 412 177 783 601 613 251 533 463 13  127 516 15
19  4/1/2021    348 588 140 966 143 576 419 611 128 830 68  209 952 935
20  5/1/2021    96  711 651 121 708 360 159 229 552 951 79  665 709 165
21  6/1/2021    805 657 729 629 249 547 581 583 236 828 636 248 412 535
22  7/1/2021    286 320 908 765 336 286 148 168 821 567 63  908 248 320
23  8/1/2021    707 975 565 699 47  712 700 439 497 106 288 105 872 158
24  9/1/2021    346 523 142 181 904 266 28  740 125 64  287 707 553 437
25  10/1/2021   245 42  773 591 492 512 846 487 983 180 372 306 785 691
26  11/1/2021   785 577 448 489 425 205 672 358 868 637 104 422 873 919'''

data_list = data.split('\n')

data_line1 = data_list[0].split()
data_line2 = data_list[1].split()
data_line3 = [''] + data_list[2].split()

data_header = [' '.join([data_line1[i],data_line2[i],data_line3[i]]) for i in range(len(data_line1))]

data_header[0] = data_header[0][:-1]

new_data= data_list[3:]

import pandas as pd
df = pd.DataFrame(data = None,columns=data_header)
for i in range(len(new_data)-1):
    df.loc[i] = new_data[i].split()[1:]

print (df)

Answer 2

Here is what worked for me.

#Assign unique column names to the dataframe
df.columns = range(df.shape[1])

#Get all the unique units in the dataframe
unitName = pd.Series(pd.Series(df.loc[2,:]).unique(), name = "Unit Names")

#Convert them to a list to loop through
unitName = unitName.tolist()

for var in unitName:
#this looks for an exact match for the unit in row index 2 and 
#extracts the entire column with the match
    df_item = df[df.columns[df.iloc[3].str.fullmatch(var)]]
    print (df_item)

Filtering columns based on row values in Pandas

Question

2 answers

solution1
0 2020-09-12 23:03:08

Solution 1: Strip columns based on common name in dataframe

Solution 2: Create column names based on first 3 rows in the source data

solution2
0 2020-09-13 14:50:49

Filtering columns based on row values in Pandas

Question

2 answers

solution1 0 2020-09-12 23:03:08

Solution 1: Strip columns based on common name in dataframe

Solution 2: Create column names based on first 3 rows in the source data

solution2 0 2020-09-13 14:50:49

solution1
0 2020-09-12 23:03:08

solution2
0 2020-09-13 14:50:49