简体   繁体   中英

How to conditionally modify string values in dataframe column - Python/Pandas

I have a dataframe of which one column ('entity) contains various names of countries and non-state entities. I need to clean the column because the string values (provided by manual data-entry) are all lower-case (china instead of China). I can't just perform the .title() operation on the column since there are string values for which I want nothing to done (eg, al Something should not be turned into AL Something).

I'm have trouble creating a function to help me with this problem and could use some guidance from the community. In the past I've used dictionaries to help map/replace incorrect strings with correct strings, and I can still revert to that way of doing things, but I thought creating this function might be more straightforward and efficient and plus I wanted to challenge myself. But no changes occurs to the entity column when I execute the function. Thanks in advance!

myString = ['al Group1', 'al Group2']

entities = df['entity']
def title_fix(entities):
    new_titles = []
    for entity in entities:
        if entity in myString:
            new_titles.append(myString)
        else:
           new_title.append(entity.title())
        return new_title

title_fix(df)

The entities in the line entities = df['entity'] is not the same variable as the entities in the line def title_fix(entities): . This second entities variable is the argument to the function title_fix , and it exists only within the function. It takes on whatever argument you pass into your call to title_fix , which is df .

Try this instead of your function:

# A list of entity names to leave alone (must exactly match character-for-character)
myString = ['al Group1', 'al Group2']
# Apply title case to every entity NOT in myString
df['entity'] = df['entity'].apply(lambda x: x if x in myString else x.title())
# Print the modified DataFrame
df

Note that this solution requires that each string in myString exactly matches the target string in df['entity'] , otherwise the target string will not be replaced.

Your code had several bugs, such as spelling and indentation. Fixed code:

myString = ['al Group1', 'al Group2']
entities = df['entity']

def title_fix(entities):
    new_titles = []
    for entity in entities:
        if entity in myString:
            new_titles.append(entity)
        else:
            new_titles.append(entity.title())
    return new_titles

df['entity'] = title_fix(entities)

However, what you want to achieve can be done in a one-liner. I came up with 3 solutions. I don't know pandas that well and I have no idea about the performance differences between these solutions, but here they are.

ignored makes a little bit more sense than myString so I'll use it.

ignored = ['al Group1', 'al Group2']

First solution:

df['entity'] = df['entity'].apply(lambda x: x.title() if x not in ignored else x)

Second:

df.entity[~df.entity.isin(ignored)] = df.entity.str.title()

Third:

df.loc[~df.entity.isin(ignored), 'entity'] = df.entity.str.title()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM