I am trying to create dataframes from this "master" dataframe based on unique entries in the row 2.
DATE PROP1 PROP1 PROP1 PROP1 PROP1 PROP1 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2
1 DAYS MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN
2 UNIT1 UNIT2 UNIT3 UNIT4 UNIT5 UNIT6 UNIT7 UNIT8 UNIT3 UNIT4 UNIT11 UNIT12 UNIT1 UNIT2
3
4 1/1/2020 677 92 342 432 878 831 293 88 69 621 586 576 972 733
5 2/1/2020 515 11 86 754 219 818 822 280 441 11 123 36 430 272
6 3/1/2020 253 295 644 401 574 184 354 12 680 729 823 822 174 602
7 4/1/2020 872 568 505 652 366 982 159 131 218 961 52 85 679 923
8 5/1/2020 93 58 864 682 346 19 293 19 206 500 793 962 630 413
9 6/1/2020 696 262 833 418 876 695 900 781 179 138 143 526 9 866
10 7/1/2020 810 58 579 244 81 858 362 440 186 425 55 920 345 596
11 8/1/2020 834 609 618 214 547 834 301 875 783 216 834 609 550 274
12 9/1/2020 687 935 976 380 885 246 339 904 627 460 659 352 361 793
13 10/1/2020 596 300 810 248 475 718 350 574 825 804 245 209 212 925
14 11/1/2020 584 984 711 879 916 107 277 412 122 683 151 811 129 4
15 12/1/2020 616 515 101 743 650 526 475 991 796 227 880 692 734 799
16 1/1/2021 106 441 305 964 452 249 282 486 374 620 652 793 115 697
17 2/1/2021 969 504 936 678 67 42 985 791 709 689 520 503 102 731
18 3/1/2021 823 169 412 177 783 601 613 251 533 463 13 127 516 15
19 4/1/2021 348 588 140 966 143 576 419 611 128 830 68 209 952 935
20 5/1/2021 96 711 651 121 708 360 159 229 552 951 79 665 709 165
21 6/1/2021 805 657 729 629 249 547 581 583 236 828 636 248 412 535
22 7/1/2021 286 320 908 765 336 286 148 168 821 567 63 908 248 320
23 8/1/2021 707 975 565 699 47 712 700 439 497 106 288 105 872 158
24 9/1/2021 346 523 142 181 904 266 28 740 125 64 287 707 553 437
25 10/1/2021 245 42 773 591 492 512 846 487 983 180 372 306 785 691
26 11/1/2021 785 577 448 489 425 205 672 358 868 637 104 422 873 919
so the output will look something like this
df_unit1
DATE PROP1 PROP2
1 DAYS MEAN MEAN
2 UNIT1 UNIT1
3
4 1/1/2020 677 972
5 2/1/2020 515 430
6 3/1/2020 253 174
7 4/1/2020 872 679
8 5/1/2020 93 630
9 6/1/2020 696 9
10 7/1/2020 810 345
11 8/1/2020 834 550
12 9/1/2020 687 361
13 10/1/2020 596 212
14 11/1/2020 584 129
15 12/1/2020 616 734
16 1/1/2021 106 115
17 2/1/2021 969 102
18 3/1/2021 823 516
19 4/1/2021 348 952
20 5/1/2021 96 709
21 6/1/2021 805 412
22 7/1/2021 286 248
23 8/1/2021 707 872
24 9/1/2021 346 553
25 10/1/2021 245 785
26 11/1/2021 785 873
df_unit2
DATE PROP1 PROP2
1 DAYS MEAN MEAN
2 UNIT2 UNIT2
3
4 1/1/2020 92 733
5 2/1/2020 11 272
6 3/1/2020 295 602
7 4/1/2020 568 923
8 5/1/2020 58 413
9 6/1/2020 262 866
10 7/1/2020 58 596
11 8/1/2020 609 274
12 9/1/2020 935 793
13 10/1/2020 300 925
14 11/1/2020 984 4
15 12/1/2020 515 799
16 1/1/2021 441 697
17 2/1/2021 504 731
18 3/1/2021 169 15
19 4/1/2021 588 935
20 5/1/2021 711 165
21 6/1/2021 657 535
22 7/1/2021 320 320
23 8/1/2021 975 158
24 9/1/2021 523 437
25 10/1/2021 42 691
26 11/1/2021 577 919
I have extracted the unique units from the row
unitName = pd.Series(pd.Series(df[2,:]).unique(), name = "Unit Names")
unitName = unitName.tolist()
Next I was planning to loop through this list of unique units and create dataframes with each units
for unit in unitName:
df_unit = df.iloc[[df.iloc[2:,:].str.match(unit)],:]
print(df_unit)
I am getting an error that 'DataFrame' object has no attribute 'str'. So my plan was to match all the cells in row2 that matches a given unit and then extract the entire column for the matched row cell.
This response has two parts:
With the assumption that your dataframe columns look as follows:
['DATE DAYS', 'PROP1 MEAN UNIT1', 'PROP1 MEAN UNIT2', 'PROP1 MEAN UNIT3', 'PROP1 MEAN UNIT4', 'PROP1 MEAN UNIT5', 'PROP1 MEAN UNIT6', 'PROP2 MEAN UNIT7', 'PROP2 MEAN UNIT8', 'PROP2 MEAN UNIT3', 'PROP2 MEAN UNIT4', 'PROP2 MEAN UNIT11', 'PROP2 MEAN UNIT12', 'PROP2 MEAN UNIT1', 'PROP2 MEAN UNIT2']
and the first few records of your dataframe looks like this...
DATE DAYS PROP1 MEAN UNIT1 ... PROP2 MEAN UNIT1 PROP2 MEAN UNIT2
0 1/1/2020 677 ... 972 733
1 2/1/2020 515 ... 430 272
2 3/1/2020 253 ... 174 602
3 4/1/2020 872 ... 679 923
4 5/1/2020 93 ... 630 413
5 6/1/2020 696 ... 9 866
6 7/1/2020 810 ... 345 596
The following lines of code should give you what you want:
cols = df.columns.tolist()
units = sorted(set(x[x.rfind('UNIT'):] for x in cols[1:]))
s_units = sorted(cols[1:],key = lambda x: x.split()[2])
for i in units:
unit_sublist = ['DATE DAYS'] + [j for j in s_units if j[-6:].strip() == i]
print ('df_' + i.lower())
print (df[unit_sublist])
I got the following:
df_unit1
DATE DAYS PROP1 MEAN UNIT1 PROP2 MEAN UNIT1
0 1/1/2020 677 972
1 2/1/2020 515 430
2 3/1/2020 253 174
3 4/1/2020 872 679
4 5/1/2020 93 630
5 6/1/2020 696 9
6 7/1/2020 810 345
df_unit11
DATE DAYS PROP2 MEAN UNIT11
0 1/1/2020 586
1 2/1/2020 123
2 3/1/2020 823
3 4/1/2020 52
4 5/1/2020 793
5 6/1/2020 143
6 7/1/2020 55
df_unit12
DATE DAYS PROP2 MEAN UNIT12
0 1/1/2020 576
1 2/1/2020 36
2 3/1/2020 822
3 4/1/2020 85
4 5/1/2020 962
5 6/1/2020 526
6 7/1/2020 920
df_unit2
DATE DAYS PROP1 MEAN UNIT2 PROP2 MEAN UNIT2
0 1/1/2020 92 733
1 2/1/2020 11 272
2 3/1/2020 295 602
3 4/1/2020 568 923
4 5/1/2020 58 413
5 6/1/2020 262 866
6 7/1/2020 58 596
df_unit3
DATE DAYS PROP1 MEAN UNIT3 PROP2 MEAN UNIT3
0 1/1/2020 342 69
1 2/1/2020 86 441
2 3/1/2020 644 680
3 4/1/2020 505 218
4 5/1/2020 864 206
5 6/1/2020 833 179
6 7/1/2020 579 186
df_unit4
DATE DAYS PROP1 MEAN UNIT4 PROP2 MEAN UNIT4
0 1/1/2020 432 621
1 2/1/2020 754 11
2 3/1/2020 401 729
3 4/1/2020 652 961
4 5/1/2020 682 500
5 6/1/2020 418 138
6 7/1/2020 244 425
df_unit5
DATE DAYS PROP1 MEAN UNIT5
0 1/1/2020 878
1 2/1/2020 219
2 3/1/2020 574
3 4/1/2020 366
4 5/1/2020 346
5 6/1/2020 876
6 7/1/2020 81
df_unit6
DATE DAYS PROP1 MEAN UNIT6
0 1/1/2020 831
1 2/1/2020 818
2 3/1/2020 184
3 4/1/2020 982
4 5/1/2020 19
5 6/1/2020 695
6 7/1/2020 858
df_unit7
DATE DAYS PROP2 MEAN UNIT7
0 1/1/2020 293
1 2/1/2020 822
2 3/1/2020 354
3 4/1/2020 159
4 5/1/2020 293
5 6/1/2020 900
6 7/1/2020 362
df_unit8
DATE DAYS PROP2 MEAN UNIT8
0 1/1/2020 88
1 2/1/2020 280
2 3/1/2020 12
3 4/1/2020 131
4 5/1/2020 19
5 6/1/2020 781
6 7/1/2020 440
Let us assume the first 6 rows of your dataframe looks like this.
DATE PROP1 PROP1 PROP1 PROP1 PROP1 PROP1 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2
DAYS MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN
UNIT1 UNIT2 UNIT3 UNIT4 UNIT5 UNIT6 UNIT7 UNIT8 UNIT3 UNIT4 UNIT11 UNIT12 UNIT1 UNIT2
4 1/1/2020 677 92 342 432 878 831 293 88 69 621 586 576 972 733
5 2/1/2020 515 11 86 754 219 818 822 280 441 11 123 36 430 272
6 3/1/2020 253 295 644 401 574 184 354 12 680 729 823 822 174 602
Then you can write the below code to create the dataframe.
data = '''DATE PROP1 PROP1 PROP1 PROP1 PROP1 PROP1 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2 PROP2
DAYS MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN MEAN
UNIT1 UNIT2 UNIT3 UNIT4 UNIT5 UNIT6 UNIT7 UNIT8 UNIT3 UNIT4 UNIT11 UNIT12 UNIT1 UNIT2
4 1/1/2020 677 92 342 432 878 831 293 88 69 621 586 576 972 733
5 2/1/2020 515 11 86 754 219 818 822 280 441 11 123 36 430 272
6 3/1/2020 253 295 644 401 574 184 354 12 680 729 823 822 174 602
7 4/1/2020 872 568 505 652 366 982 159 131 218 961 52 85 679 923
8 5/1/2020 93 58 864 682 346 19 293 19 206 500 793 962 630 413
9 6/1/2020 696 262 833 418 876 695 900 781 179 138 143 526 9 866
10 7/1/2020 810 58 579 244 81 858 362 440 186 425 55 920 345 596
11 8/1/2020 834 609 618 214 547 834 301 875 783 216 834 609 550 274
12 9/1/2020 687 935 976 380 885 246 339 904 627 460 659 352 361 793
13 10/1/2020 596 300 810 248 475 718 350 574 825 804 245 209 212 925
14 11/1/2020 584 984 711 879 916 107 277 412 122 683 151 811 129 4
15 12/1/2020 616 515 101 743 650 526 475 991 796 227 880 692 734 799
16 1/1/2021 106 441 305 964 452 249 282 486 374 620 652 793 115 697
17 2/1/2021 969 504 936 678 67 42 985 791 709 689 520 503 102 731
18 3/1/2021 823 169 412 177 783 601 613 251 533 463 13 127 516 15
19 4/1/2021 348 588 140 966 143 576 419 611 128 830 68 209 952 935
20 5/1/2021 96 711 651 121 708 360 159 229 552 951 79 665 709 165
21 6/1/2021 805 657 729 629 249 547 581 583 236 828 636 248 412 535
22 7/1/2021 286 320 908 765 336 286 148 168 821 567 63 908 248 320
23 8/1/2021 707 975 565 699 47 712 700 439 497 106 288 105 872 158
24 9/1/2021 346 523 142 181 904 266 28 740 125 64 287 707 553 437
25 10/1/2021 245 42 773 591 492 512 846 487 983 180 372 306 785 691
26 11/1/2021 785 577 448 489 425 205 672 358 868 637 104 422 873 919'''
data_list = data.split('\n')
data_line1 = data_list[0].split()
data_line2 = data_list[1].split()
data_line3 = [''] + data_list[2].split()
data_header = [' '.join([data_line1[i],data_line2[i],data_line3[i]]) for i in range(len(data_line1))]
data_header[0] = data_header[0][:-1]
new_data= data_list[3:]
import pandas as pd
df = pd.DataFrame(data = None,columns=data_header)
for i in range(len(new_data)-1):
df.loc[i] = new_data[i].split()[1:]
print (df)
Here is what worked for me.
#Assign unique column names to the dataframe
df.columns = range(df.shape[1])
#Get all the unique units in the dataframe
unitName = pd.Series(pd.Series(df.loc[2,:]).unique(), name = "Unit Names")
#Convert them to a list to loop through
unitName = unitName.tolist()
for var in unitName:
#this looks for an exact match for the unit in row index 2 and
#extracts the entire column with the match
df_item = df[df.columns[df.iloc[3].str.fullmatch(var)]]
print (df_item)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.