I have a dataframe as follows:
name teamA teamB
foo a b
foo b c
foo c b
bar a e
bar a d
...
I want to find intersection of rows for each name separately but for both columns teamA and teamB. And after that to remove value of cell that contains that intersection value. In this example, for name "foo" intersection of rows would be "b", and for name "bar" would be "a". So data frame after removing this intersection values would look like:
name teamA teamB
foo a " "
foo " " c
foo c " "
bar " " e
bar " " d
...
Recently, I've tried with teamA and teamB as one column named for example teams.
name teams
foo [a, b]
foo [b, c]
foo [c, b]
...
after I would like to get
name teams
foo [a, " "]
foo [" ", c]
foo [c, " "]
...
But I've found it is more recommended to separate it in two columns and I found answer that is interesting but I don't know how to apply it on grouped data frame. https://stackoverflow.com/a/55554709/9168586 (look at "Filter on MANY Columns" section and "to retain rows where at least one column is True"). As in that example:
dataframe[['teamA', 'teamB']].isin('b').any(axis=1)
0 True
1 True
2 True
3 True
dtype: bool
where 'b' would be one of the values(teams) through which I would iterate. After every iteration if whole column is True I would remove that value from columns teamA or teamB in every row and continue to another group.
Errors that I get are:
Cannot access callable attribute 'isin' of 'DataFrameGroupBy' objects, try using the 'apply' method
and
only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'
We can do melt
, then drop the duplicate, and pivot
it back
s=df.reset_index().melt(['index','name']).\
drop_duplicates(['name','value'],keep=False).\
pivot_table(index=['index','name'],columns='variable',values='value',aggfunc='first').\
fillna('').reset_index(level=1)
s['team']=list(zip(s.teamA,s.teamB))
s
Out[102]:
variable name teamA teamB team
index
0 foo a (a, )
1 foo c (, c)
2 foo d (, d)
3 bar e (, e)
4 bar d (, d)
Try groupby
and apply stack
, drop_duplicates
, unstack
, fillna
(df[['teamA', 'teamB']].groupby(df.name, sort=False)
.apply(lambda x: x.stack().drop_duplicates(keep=False))
.unstack().fillna('').reset_index('name'))
Out[93]:
name teamA teamB
0 foo a
1 foo c
2 foo d
3 bar e
4 bar d
Maybe is not as nice as @WeNYoBen sulution but you could consider to use a custom function which is pretty flexible
import pandas as pd
df = pd.DataFrame({"name":["foo"]*3+["bar"]*2,
"teamA":["a", "b", "b", "a", "a"],
"teamB":["b", "c", "d", "e", "d"]})
def fun(x):
toRemove = list(set(x["teamA"].values).intersection(x["teamB"]))
for col in ["teamA", "teamB"]:
x[col] = np.where(x[col].isin(toRemove), " ", x[col])
return x
df.groupby("name").apply(fun)
which output is:
name teamA teamB
0 foo a
1 foo c
2 foo d
3 bar a e
4 bar a d
groupby.apply
+ Series.isin
. Sample DataFrame:
print(df)
name teamA teamB
0 foo a b
1 foo b c
2 foo b d
3 bar a e
4 bar a d
5 bar b a
new_df=df.copy()
groups=df.groupby('name',sort=False)
new_df['teamA']=groups.apply(lambda x: x['teamA'].mask(x['teamA'].isin(x['teamB']),' ')).reset_index(drop=True)
new_df['teamB']=groups.apply(lambda x: x['teamB'].mask(x['teamB'].isin(x['teamA']),' ')).reset_index(drop=True)
print(new_df)
name teamA teamB
0 foo a
1 foo c
2 foo d
3 bar e
4 bar d
5 bar b
Then use DataFrame.apply
+ join
and split
to get teams
column:
new_df['teams']=new_df[['teamA','teamB']].apply(lambda x: ','.join(x).split(','),axis=1)
print(new_df)
name teamA teamB teams
0 foo a [a, ]
1 foo c [ , c]
2 foo d [ , d]
3 bar e [ , e]
4 bar d [ , d]
5 bar b [b, ]
After I edited my question yesterday... This is my data frame( df
):
name teamA teamB year
foo a b 1
foo b c 1
foo c b 1
bar a e 2
bar a d 2
foo a h 2
foo h c 2
foo h b 2
...
This is the solution:
def fun(x):
melted = pd.melt(x.reset_index(), id_vars=['name', 'year'], value_vars=['teamA', 'teamB'], var_name='var_name',
value_name='team')
toRemove = melted.team.mode().iloc[0]
for col in ["teamA", "teamB"]:
x[col] = x[col].replace(toRemove,'something')
return x
df = df.groupby(["name", "year"]).apply(fun)
So, I melt my dataframe and find the most frequent value after what I remove that value from two columns. Thanks @rpanai, Every answer was helpful, but your the most!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.