I'm trying to write a script that checks on a column of a DataFrame that each value isn't a substring of another value, and isn't equal to a different column. I wrote a code that goes over iterrows and returns for each row the other substring values. an example:
df = pd.DataFrame({'names': ['Bob', 'Sam', 'Tom', 'Bob'], 'value': ['abc', 'ab', 'de', 'ab']})
>>> df
names value
0 Bob abc
1 Sam ab
2 Tom de
3 Bob ab
substring_df = pd.DataFrame(columns=df.columns)
for index, row in df.iterrows():
value = row["value"]
name = row["names"]
delta = df[df['value'].str.contains(value) & df['names'] == name]
if(len(delta.index) > 1):
substring_df = pd.concat([substring_df, delta])
>>> substring_df
names value
0 Bob abc
3 Bob ab
This code works fine but it is very slow for a big amount of data. running it on a DataFrame containing 10,000 rows took 2 minutes to return, and I need to run it on even bigger data.
Any ideas on how to make this code more efficient?
Use GroupBy.transform
with generator for found substrings with in
and filter groups by boolean indexing
:
df = pd.DataFrame({"names": ["Bob", "Bob", "Bob", "Alice"], "value": ["abc", "ab", "d", "a"]})
print (df)
names value
0 Bob abc
1 Bob ab
2 Bob d
3 Alice a
f = lambda x: x.isin([w for y in x for z in x if z != y and z in y for w in (z, y)])
df = df[df.groupby('names')['value'].transform(f)]
print (df)
names value
0 Bob abc
1 Bob ab
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.