简体   繁体   中英

How do I concisely replace column values given multiple conditions?

I'm trying to use numpy.select to replace string values within a column; if string contains a keyword, I need the whole string to be replaced with another keyword (there are +- 25 combinations).

df["new_col"] = np.select(
    condlist=[
        df["col"].str.contains("cat1", na=False, case=False),
        df["col"].str.contains("cat2", na=False, case=False),
        df["col"].str.contains("cat3", na=False, case=False),
        df["col"].str.contains("cat4", na=False, case=False),
        # ...
        df["col"].str.contains("cat25", na=False, case=False),
    ],
    choicelist=[
        "NEW_cat1",
        "NEW_cat2",
        "NEW_cat3",
        "NEW_cat4",
        # ...
        "NEW_cat25"
    ],
    default="DEFAULT_cat",
)

Is there a more concise way, or should I just repeat str.contains(...) within condlist 25 times?; is numpy.select the proper way to do it, at all?

I assume dict could be used here, but don't see how exactly.

df["col"].map(d) where d is a dict with old and new values like {"cat1":"NEW_cat1"} wouldn't work (?) since I can't hardcode exact values that need to be replaced (and that's why I'm using str.contains ).

Should be able to use str.extract and then map the matches.

Setup

import pandas as pd
import re

df = pd.DataFrame({'col': ['foo', 'foOBar', 'oRange', 'manGo', 'i LIKE PIZZA', 
                           'some sentence with foo', 'foo and PizzA']})

cat_list = ['foo', 'orange', 'pizza']  # all lower case
label_l = ['Label_foo', 'Label_orange', 'Label_pizza']

Code

patt = re.compile('('+'|'.join(cat_list)+')', re.IGNORECASE)

df['new_col'] = (df.col.str.extract(patt)[0]  # First label in str if multiple
                   .str.lower()
                   .map(dict(zip(cat_list, label_l)))
                   .fillna('DEFAULT_LABEL'))

                      col        new_col
0                     foo      Label_foo
1                  foOBar      Label_foo
2                  oRange   Label_orange
3                   manGo  DEFAULT_LABEL
4            i LIKE PIZZA    Label_pizza
5  some sentence with foo      Label_foo
6           foo and PizzA      Label_foo

If there's a chance for multiple matches and we need to implement a hierarchy where 'pizza' should be prioritized over 'foo' we can add a few more steps using an ordered category dtype.

cat_list = ['pizza', 'orange', 'foo']  # ordered in priority
label_l = ['Label_pizza', 'Label_orange', 'Label_foo']

my_cat = pd.api.types.CategoricalDtype(categories=cat_list, ordered=True)

s = (df.col.str.extractall(patt)[0]
       .str.lower()
       .astype(my_cat))

df['new_col'] = (s.to_frame().groupby(level=0).min()[0]  # min gets priority
                  .map(dict(zip(cat_list, label_l))))
df['new_col'] = df['new_col'].astype(str).replace('nan', 'DEFAULT_LABEL')
#                      col        new_col
#0                     foo      Label_foo
#1                  foOBar      Label_foo
#2                  oRange   Label_orange
#3                   manGo  DEFAULT_LABEL
#4            i LIKE PIZZA    Label_pizza
#5  some sentence with foo      Label_foo
#6           foo and PizzA    Label_pizza

The contents you are passing as the condlist and choicelist parameters are ordinary Python lists. List contents can be produced in a concise way in the language by using list comprehensions, that is the syntax: [expression_using_item for item in sequence]

In other words, your code can be written as:

df["new_col"] = np.select(
    condlist=[
       df["col"].str.contains(f"cat{i}", na=False, case=False) for i in range(1, 26)],        
    choicelist=[f"NEW_cat{i}" for i in range(1, 26)],
    default="DEFAULT_cat",
)

(and if the category names is not a numeric sequence, and you are giving these names here just as an example, you create a sequence (list) with all the explicit category names and plug that in place of the range() call in the snippet above)

Based off of this answer to a similar question , and this one , a simple solution:

import pandas as pd
import string

# Preparing test data
test_cont = [f"cat_{i}" for i in string.ascii_lowercase]
test_rep = [f"cat_{i}" for i in range(27)]

kv = zip(test_cont, test_rep)

test_df_data = zip(range(27), test_cont)

test_df = pd.DataFrame(data=test_df_data, columns=["some_col", "str_vals"])


# The solution itself
for (cont, rep) in kv:
    cont_mask = test_df["str_vals"].str.contains(cont, na=False, case=False)
    test_df.loc[cont_mask, "str_vals"] = rep

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM