简体   繁体   中英

Pandas: drop rows based on duplicated values in a list

I would like to drop rows within my dataframe based on if a piece of a string is duplicated within that string. For example, if the string is jkl-ghi-jkl, I would drop this row because jkl is repeated twice. I figured that creating a list and checking the list for duplicates would be the ideal approach.

My dataframe for this example consist of 1 column and two data points:

    df1 = pd.DataFrame({'Col1' : ['abc-def-ghi-jkl', 'jkl-ghi-jkl-mno'],})

My first step I take is to apply a split to my data, and split of "-"

    List = df1['Col1].str.split('-')
    List

Which is yields the output:

    0     [abc, def, ghi, jkl]
    1     [jkl, ghi, jkl, mno]
    Name: Col1, dtype: object

My second step I take is to convert my output into lists:

    List = List.tolist()

Which yields:

    [['abc', 'def', 'ghi', 'jkl'], ['jkl', 'ghi', 'jkl', 'mno']]

My last step I wish to accomplish is to compare a full list with a distinct list of unique values:

    len(List) > len(set(List))

Which yields the error:

    TypeError: unhashable type: 'list'

I am aware that my .tolist() creates a list of 2 series. Is there a way to convert these series into a list in order to test for duplicates? I wish to use this piece of code:

    len(List) > len(set(List)

with a drop in order to drop all rows with a duplicated value within each cell.

Is this the correct way of approaching, or is there a simpler way?

My end output should look like:

     Col1
     abc-def-ghi-jkl

Because string jkl-ghi-jkl-mno gets dropped due to "jkl" repeating twice

You can combine str.split with duplicated to get a Boolean indexer:

# Get a Boolean indexer for duplicates.
dupe_rows = df1['Col1'].str.split('-', expand=True)
dupe_rows = dupe_rows.apply(lambda row: row.duplicated().any(), axis=1)

# Remove the duplicates.
df1 = df1[~dupe_rows]

Edit

Another option is to use toolz.isdistinct in a similar manner as the other answers:

import toolz

df1[df1.Col1.str.split('-').apply(toolz.isdistinct)]

Here is another option, using set and len :

df1 = pd.DataFrame({'Col1' : ['abc-def-ghi-jkl', 'jkl-ghi-jkl-mno'],})

df1['length'] = df1['Col1'].str.split('-').apply(set).apply(len)

print( df1 )

              Col1  length
0  abc-def-ghi-jkl       4
1  jkl-ghi-jkl-mno       3

df1 = df1.loc[ df1['length'] < 4 ]

print(df1)

              Col1  length
1  jkl-ghi-jkl-mno       3

split 'Col1' and apply a repeat checker using an efficient numpy algorithm.

def nerpt(lst):
    ti = np.triu_indices(len(lst), 1)
    a = np.array(lst)
    return (a[ti[0]] == a[ti[1]]).any()

df1[~df1.Col1.str.split('-').apply(nerpt)]

在此输入图像描述


Timings

Pretty clear using set is most efficient. This is reflective of @Luis's answer

Using pd.concat([df1 for _ in range(10000)])

rpt1 = lambda lst: not pd.Index(lst).is_unique
rpt2 = lambda lst: len(lst) != len(set(lst))
rpt3 = nerpt

在此输入图像描述

I went the same route you did, but instead kept everything in one dataframe; used apply() and indexed to get what I needed:

[in]:
gf1 = df1
gf1['Col2'] = gf1['Col1'].str.split('-')  #keep lists in same DF
gf1['Col3'] = gf1['Col2'].apply(set).apply(len) == gf1['Col2'].apply(len)
df1 = gf1['Col1'].loc[gf1['Col3'] == True]
df1
[Out]: 
0 abc-def-ghi-jkl
Name: Col1, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM