I'm trying to match a column in a DataFrame to one of a list of substrings.
eg take the column ( strings
) with the following values:
text1C1
text2A
text2
text4
text4B
text4A3
And create a new column which has matched them to the following substrings:
vals = ['text1', 'text2', 'text3', 'text4', 'text4B']
The code I have at the moment works, but it seems like a really inefficient way of solving the problem.
df = pd.DataFrame({'strings': ['text1C1', 'text2A', 'text2', 'text4', 'text4B', 'text4A3']})
for v in vals:
df.loc[df[df['strings'].str.contains(v)].index, 'matched strings'] = v
This returns the following DataFrame, which is what I need.
strings matched strings
0 text1C1 text1
1 text2A text2
2 text2 text2
3 text4 text4
4 text4B text4B
5 text4A3 text4
Is there a more efficient way of doing this especially for larger DataFrames (10k+ rows)?
I cant think of how to deal with one of the items of vals
also being a substring of another ( text4
is a substring of text4B
)
Use generator with next
for match first value:
s = vals[::-1]
df['matched strings1'] = df['strings'].apply(lambda x: next(y for y in s if y in x))
print (df)
strings matched strings matched strings1
0 text1C1 text1 text1
1 text2A text2 text2
2 text2 text2 text2
3 text4 text4 text4
4 text4B text4B text4B
5 text4A3 text4 text4
More general solution if possible no matched values with iter
and default parameter of next
:
f = lambda x: next(iter(y for y in s if y in x), 'no match')
df['matched strings1'] = df['strings'].apply(f)
Your solution should be improved:
for v in vals:
df.loc[df['strings'].str.contains(v, regex=False), 'matched strings'] = v
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.