Pandas: how to keep in other new dataframe duplicated row from a column when value change on other column?

Question

I have a Pandas DataFrame looking like this, df:

text label
a    country
a    sport
b    cooking
b    cooking
c    travel
c    design
d    tech

I would like to have two dataframes. One with duplicated rows from 'text' column when value on 'label' column change. And the other keeping everything else.

Expected outputs, df1:

text label
a    country
a    sport
c    travel
c    design

And df2:

text label
b    cooking
b    cooking
d    tech

Answer 1

Use DataFrame.duplicated for test one or multiple columns for masks:

m1 = df.duplicated('text', keep=False)
m2 = df.duplicated(['text','label'], keep=False)
#if all columns
#m2 = df.duplicated(keep=False)
mask = m2 | ~m1

df1 = df[~mask]
df2 = df[mask]

print (df1)
  text    label
0    a  country
1    a    sport
4    c   travel
5    c   design

print (df2)
  text    label
2    b  cooking
3    b  cooking
6    d     tech

Another approach is check number of unique values per groups - if equal like 1 or not:

mask = df.groupby('text')['label'].transform('nunique').eq(1)
df1 = df[~mask]
df2 = df[mask]

If change data ouput is different:

print (df)
  text    label
0    a  country
1    a    sport
2    a    sport
3    b  cooking
4    b  cooking
5    c   travel
6    c   design
7    d     tech
    

m1 = df.duplicated('text', keep=False)
m2 = df.duplicated(['text','label'], keep=False)
#if all columns
#m2 = df.duplicated(keep=False)
mask = m2 | ~m1

df1 = df[~mask]
df2 = df[mask]
print (df1)
  text    label
0    a  country
5    c   travel
6    c   design

print (df2)
  text    label
1    a    sport
2    a    sport
3    b  cooking
4    b  cooking
7    d     tech

mask = df.groupby('text')['label'].transform('nunique').eq(1)
df1 = df[~mask]
df2 = df[mask]
print (df1)
  text    label
0    a  country
1    a    sport
2    a    sport
5    c   travel
6    c   design

print (df2)
  text    label
3    b  cooking
4    b  cooking
7    d     tech

Answer 2

# get index of rows have duplicated `text` duplicated = df.duplicated('text', keep=False) duplicated_index = duplicated[duplicated == True].index # select df1 and df2 according to this index df1 = df.loc[duplicated_index].reset_index(drop=True) df2 = df.loc[set(df.index) - set(duplicated_index)].reset_index(drop=True) # we get df1 text label 0 a country 1 a sport 2 c travel 3 c designe df2 text label 0 b cooking 1 d tech

Pandas: how to keep in other new dataframe duplicated row from a column when value change on other column?

Question

2 answers

solution1
2 ACCPTED 2021-10-01 10:03:24

solution2
1 2021-10-01 09:52:51

Pandas: how to keep in other new dataframe duplicated row from a column when value change on other column?

Question

2 answers

solution1 2 ACCPTED 2021-10-01 10:03:24

solution2 1 2021-10-01 09:52:51

solution1
2 ACCPTED 2021-10-01 10:03:24

solution2
1 2021-10-01 09:52:51