Pandas - Check if a value in a column is a substring of another value in the same column

Question

I'm trying to write a script that checks on a column of a DataFrame that each value isn't a substring of another value, and isn't equal to a different column. I wrote a code that goes over iterrows and returns for each row the other substring values. an example:

df = pd.DataFrame({'names': ['Bob', 'Sam', 'Tom', 'Bob'], 'value': ['abc', 'ab', 'de', 'ab']})
>>> df
  names value
0   Bob   abc
1   Sam    ab
2   Tom    de
3   Bob    ab

substring_df = pd.DataFrame(columns=df.columns)
for index, row in df.iterrows():
            value = row["value"]
            name = row["names"]
            delta = df[df['value'].str.contains(value) & df['names'] == name]
            if(len(delta.index) > 1):
                    substring_df = pd.concat([substring_df, delta])
>>> substring_df
  names value
0   Bob   abc
3   Bob    ab

This code works fine but it is very slow for a big amount of data. running it on a DataFrame containing 10,000 rows took 2 minutes to return, and I need to run it on even bigger data.

Any ideas on how to make this code more efficient?

Answer 1

Use GroupBy.transform with generator for found substrings with in and filter groups by boolean indexing :

df = pd.DataFrame({"names": ["Bob", "Bob", "Bob", "Alice"], "value": ["abc", "ab", "d", "a"]}) 
print (df)
   names value
0    Bob   abc
1    Bob    ab
2    Bob     d
3  Alice     a

f = lambda x: x.isin([w for y in x for z in x if z != y and z in y for w in (z, y)])

df = df[df.groupby('names')['value'].transform(f)]
print (df)
  names value
0   Bob   abc
1   Bob    ab

Pandas - Check if a value in a column is a substring of another value in the same column

Question

1 answers

solution1
1 ACCPTED 2019-11-20 09:56:46

Pandas - Check if a value in a column is a substring of another value in the same column

Question

1 answers

solution1 1 ACCPTED 2019-11-20 09:56:46

solution1
1 ACCPTED 2019-11-20 09:56:46