简体   繁体   中英

How to find intersection of two column rows of data frame that is grouped by and remove that value from cells that contains it?

I have a dataframe as follows:

name  teamA   teamB
foo    a        b
foo    b        c
foo    c        b
bar    a        e
bar    a        d
...

I want to find intersection of rows for each name separately but for both columns teamA and teamB. And after that to remove value of cell that contains that intersection value. In this example, for name "foo" intersection of rows would be "b", and for name "bar" would be "a". So data frame after removing this intersection values would look like:

name  teamA   teamB
foo     a      " "
foo    " "      c
foo     c      " "
bar    " "      e
bar    " "      d
...

Recently, I've tried with teamA and teamB as one column named for example teams.

name   teams
foo    [a, b]
foo    [b, c]
foo    [c, b]
...

after I would like to get

name   teams
foo    [a, " "]
foo    [" ", c]
foo    [c, " "]
...

But I've found it is more recommended to separate it in two columns and I found answer that is interesting but I don't know how to apply it on grouped data frame. https://stackoverflow.com/a/55554709/9168586 (look at "Filter on MANY Columns" section and "to retain rows where at least one column is True"). As in that example:

dataframe[['teamA', 'teamB']].isin('b').any(axis=1)

0     True
1     True
2     True
3     True
dtype: bool

where 'b' would be one of the values(teams) through which I would iterate. After every iteration if whole column is True I would remove that value from columns teamA or teamB in every row and continue to another group.

Errors that I get are:

Cannot access callable attribute 'isin' of 'DataFrameGroupBy' objects, try using the 'apply' method

and

only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'

We can do melt , then drop the duplicate, and pivot it back

s=df.reset_index().melt(['index','name']).\
      drop_duplicates(['name','value'],keep=False).\
         pivot_table(index=['index','name'],columns='variable',values='value',aggfunc='first').\
            fillna('').reset_index(level=1)
s['team']=list(zip(s.teamA,s.teamB))
s
Out[102]: 
variable name teamA teamB   team
index                           
0         foo     a        (a, )
1         foo           c  (, c)
2         foo           d  (, d)
3         bar           e  (, e)
4         bar           d  (, d)

Try groupby and apply stack , drop_duplicates , unstack , fillna

(df[['teamA', 'teamB']].groupby(df.name, sort=False)
                       .apply(lambda x: x.stack().drop_duplicates(keep=False))
                       .unstack().fillna('').reset_index('name'))

Out[93]:
  name teamA teamB
0  foo     a
1  foo           c
2  foo           d
3  bar           e
4  bar           d

Maybe is not as nice as @WeNYoBen sulution but you could consider to use a custom function which is pretty flexible

import pandas as pd
df = pd.DataFrame({"name":["foo"]*3+["bar"]*2,
                   "teamA":["a", "b", "b", "a", "a"],
                   "teamB":["b", "c", "d", "e", "d"]})


def fun(x):
    toRemove = list(set(x["teamA"].values).intersection(x["teamB"]))
    for col in ["teamA", "teamB"]:
        x[col] = np.where(x[col].isin(toRemove), " ", x[col])
    return x


df.groupby("name").apply(fun)

which output is:

  name teamA teamB
0  foo     a      
1  foo           c
2  foo           d
3  bar     a     e
4  bar     a     d

groupby.apply + Series.isin .

Sample DataFrame:

print(df)

  name teamA teamB
0  foo     a     b
1  foo     b     c
2  foo     b     d
3  bar     a     e
4  bar     a     d
5  bar     b     a

new_df=df.copy()
groups=df.groupby('name',sort=False)
new_df['teamA']=groups.apply(lambda x: x['teamA'].mask(x['teamA'].isin(x['teamB']),' ')).reset_index(drop=True)
new_df['teamB']=groups.apply(lambda x: x['teamB'].mask(x['teamB'].isin(x['teamA']),' ')).reset_index(drop=True)
print(new_df)

  name teamA teamB
0  foo     a      
1  foo           c
2  foo           d
3  bar           e
4  bar           d
5  bar     b   

Then use DataFrame.apply + join and split to get teams column:

new_df['teams']=new_df[['teamA','teamB']].apply(lambda x: ','.join(x).split(','),axis=1)
print(new_df)

  name teamA teamB   teams
0  foo     a        [a,  ]
1  foo           c  [ , c]
2  foo           d  [ , d]
3  bar           e  [ , e]
4  bar           d  [ , d]
5  bar     b        [b,  ]

After I edited my question yesterday... This is my data frame( df ):

name  teamA   teamB year
foo    a        b    1
foo    b        c    1
foo    c        b    1
bar    a        e    2
bar    a        d    2
foo    a        h    2
foo    h        c    2
foo    h        b    2
...

This is the solution:

def fun(x):
    melted = pd.melt(x.reset_index(), id_vars=['name', 'year'], value_vars=['teamA', 'teamB'], var_name='var_name',
                    value_name='team')
    toRemove = melted.team.mode().iloc[0]
    for col in ["teamA", "teamB"]:
        x[col] = x[col].replace(toRemove,'something')
    return x


df = df.groupby(["name", "year"]).apply(fun)

So, I melt my dataframe and find the most frequent value after what I remove that value from two columns. Thanks @rpanai, Every answer was helpful, but your the most!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM