简体   繁体   中英

Python Pandas: Reduce DataFrame to Unique Combinations

I have a data set that lists some basketball player names along with their positions. With that data I have created a DataFrame that lists all possible lineup combinations. That all works just fine. My issue is: Since some players are eligible at multiple positions, that DataFrame includes records that have the same set of players, but listed at different positions. Here's a small example from the dataframe:

PG SG SF PF C G F UTIL
Luka Doncic Tim Hardaway Jr. Dillon Brooks Keldon Johnson Xavier Tillman Sr. Tyus Jones DeMar DeRozan Bradley Beal
Tyus Jones Dillon Brooks Tim Hardaway Jr. DeMar DeRozan Xavier Tillman Sr. Luka Doncic Keldon Johnson Bradley Beal
Tyus Jones Bradley Beal Keldon Johnson DeMar DeRozan Xavier Tillman Sr. Tim Hardaway Jr. Brandon Clarke Luka Doncic
Tyus Jones Tim Hardaway Jr. Keldon Johnson DeMar DeRozan Brandon Clarke Bradley Beal Xavier Tillman Sr. Luka Doncic
Luka Doncic Tim Hardaway Jr. Kyle Anderson Keldon Johnson Jonas Valanciunas Tyus Jones Xavier Tillman Sr. Bradley Beal
Luka Doncic Bradley Beal Keldon Johnson Kyle Anderson Jonas Valanciunas Tyus Jones Xavier Tillman Sr. Tim Hardaway Jr.

As you can see, the same players are in record 1 and 2, but listed at different positions. Likewise, the same players are in 3 and 4. And same in 5 and 6. Note: This a simplified example; There are way more lineups with the same players. I need each unique set of players, regardless of position to be represented by one record. It doesn't matter if it shows the first or last record with that combination of players. So how do I reduce the dataframe above to something like the dataframe below? I'll also need to reset the index once the dataframe is reduced.

PG SG SF PF C G F UTIL
Luka Doncic Tim Hardaway Jr. Dillon Brooks Keldon Johnson Xavier Tillman Sr. Tyus Jones DeMar DeRozan Bradley Beal
Tyus Jones Bradley Beal Keldon Johnson DeMar DeRozan Xavier Tillman Sr. Tim Hardaway Jr. Brandon Clarke Luka Doncic
Luka Doncic Tim Hardaway Jr. Kyle Anderson Keldon Johnson Jonas Valanciunas Tyus Jones Xavier Tillman Sr. Bradley Beal

Thank you very much in advance!

You can groupby a set representation of each row and then pick off the first/last:

In [16]: df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [2, 1, 2, 5, 4], 'c': [3, 3, 1, 6, 6]})

In [17]: df
Out[17]:
   a  b  c
0  1  2  3
1  2  1  3
2  3  2  1
3  4  5  6
4  5  4  6

In [18]: df.groupby(df.apply(lambda x: tuple(set(x)), axis=1)).first()
Out[18]:
           a  b  c
(1, 2, 3)  1  2  3
(4, 5, 6)  4  5  6

In [19]: df.groupby(df.apply(lambda x: tuple(set(x)), axis=1)).last()
Out[19]:
           a  b  c
(1, 2, 3)  3  2  1
(4, 5, 6)  5  4  6

You can also clear that index with .reset_index(drop=True) at the end.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM