I'm trying to use numpy.select
to replace string values within a column; if string contains a keyword, I need the whole string to be replaced with another keyword (there are +- 25 combinations).
df["new_col"] = np.select(
condlist=[
df["col"].str.contains("cat1", na=False, case=False),
df["col"].str.contains("cat2", na=False, case=False),
df["col"].str.contains("cat3", na=False, case=False),
df["col"].str.contains("cat4", na=False, case=False),
# ...
df["col"].str.contains("cat25", na=False, case=False),
],
choicelist=[
"NEW_cat1",
"NEW_cat2",
"NEW_cat3",
"NEW_cat4",
# ...
"NEW_cat25"
],
default="DEFAULT_cat",
)
Is there a more concise way, or should I just repeat str.contains(...)
within condlist
25 times?; is numpy.select
the proper way to do it, at all?
I assume dict
could be used here, but don't see how exactly.
df["col"].map(d)
where d
is a dict with old and new values like {"cat1":"NEW_cat1"}
wouldn't work (?) since I can't hardcode exact values that need to be replaced (and that's why I'm using str.contains
).
Should be able to use str.extract
and then map the matches.
import pandas as pd
import re
df = pd.DataFrame({'col': ['foo', 'foOBar', 'oRange', 'manGo', 'i LIKE PIZZA',
'some sentence with foo', 'foo and PizzA']})
cat_list = ['foo', 'orange', 'pizza'] # all lower case
label_l = ['Label_foo', 'Label_orange', 'Label_pizza']
patt = re.compile('('+'|'.join(cat_list)+')', re.IGNORECASE)
df['new_col'] = (df.col.str.extract(patt)[0] # First label in str if multiple
.str.lower()
.map(dict(zip(cat_list, label_l)))
.fillna('DEFAULT_LABEL'))
col new_col
0 foo Label_foo
1 foOBar Label_foo
2 oRange Label_orange
3 manGo DEFAULT_LABEL
4 i LIKE PIZZA Label_pizza
5 some sentence with foo Label_foo
6 foo and PizzA Label_foo
If there's a chance for multiple matches and we need to implement a hierarchy where 'pizza' should be prioritized over 'foo' we can add a few more steps using an ordered category dtype.
cat_list = ['pizza', 'orange', 'foo'] # ordered in priority
label_l = ['Label_pizza', 'Label_orange', 'Label_foo']
my_cat = pd.api.types.CategoricalDtype(categories=cat_list, ordered=True)
s = (df.col.str.extractall(patt)[0]
.str.lower()
.astype(my_cat))
df['new_col'] = (s.to_frame().groupby(level=0).min()[0] # min gets priority
.map(dict(zip(cat_list, label_l))))
df['new_col'] = df['new_col'].astype(str).replace('nan', 'DEFAULT_LABEL')
# col new_col
#0 foo Label_foo
#1 foOBar Label_foo
#2 oRange Label_orange
#3 manGo DEFAULT_LABEL
#4 i LIKE PIZZA Label_pizza
#5 some sentence with foo Label_foo
#6 foo and PizzA Label_pizza
The contents you are passing as the condlist
and choicelist
parameters are ordinary Python lists. List contents can be produced in a concise way in the language by using list comprehensions, that is the syntax: [expression_using_item for item in sequence]
In other words, your code can be written as:
df["new_col"] = np.select(
condlist=[
df["col"].str.contains(f"cat{i}", na=False, case=False) for i in range(1, 26)],
choicelist=[f"NEW_cat{i}" for i in range(1, 26)],
default="DEFAULT_cat",
)
(and if the category names is not a numeric sequence, and you are giving these names here just as an example, you create a sequence (list) with all the explicit category names and plug that in place of the range()
call in the snippet above)
Based off of this answer to a similar question , and this one , a simple solution:
import pandas as pd
import string
# Preparing test data
test_cont = [f"cat_{i}" for i in string.ascii_lowercase]
test_rep = [f"cat_{i}" for i in range(27)]
kv = zip(test_cont, test_rep)
test_df_data = zip(range(27), test_cont)
test_df = pd.DataFrame(data=test_df_data, columns=["some_col", "str_vals"])
# The solution itself
for (cont, rep) in kv:
cont_mask = test_df["str_vals"].str.contains(cont, na=False, case=False)
test_df.loc[cont_mask, "str_vals"] = rep
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.