I have a dataframe of which one column ('entity) contains various names of countries and non-state entities. I need to clean the column because the string values (provided by manual data-entry) are all lower-case (china instead of China). I can't just perform the .title() operation on the column since there are string values for which I want nothing to done (eg, al Something should not be turned into AL Something).
I'm have trouble creating a function to help me with this problem and could use some guidance from the community. In the past I've used dictionaries to help map/replace incorrect strings with correct strings, and I can still revert to that way of doing things, but I thought creating this function might be more straightforward and efficient and plus I wanted to challenge myself. But no changes occurs to the entity column when I execute the function. Thanks in advance!
myString = ['al Group1', 'al Group2']
entities = df['entity']
def title_fix(entities):
new_titles = []
for entity in entities:
if entity in myString:
new_titles.append(myString)
else:
new_title.append(entity.title())
return new_title
title_fix(df)
The entities
in the line entities = df['entity']
is not the same variable as the entities
in the line def title_fix(entities):
. This second entities
variable is the argument to the function title_fix
, and it exists only within the function. It takes on whatever argument you pass into your call to title_fix
, which is df
.
Try this instead of your function:
# A list of entity names to leave alone (must exactly match character-for-character)
myString = ['al Group1', 'al Group2']
# Apply title case to every entity NOT in myString
df['entity'] = df['entity'].apply(lambda x: x if x in myString else x.title())
# Print the modified DataFrame
df
Note that this solution requires that each string in myString
exactly matches the target string in df['entity']
, otherwise the target string will not be replaced.
Your code had several bugs, such as spelling and indentation. Fixed code:
myString = ['al Group1', 'al Group2']
entities = df['entity']
def title_fix(entities):
new_titles = []
for entity in entities:
if entity in myString:
new_titles.append(entity)
else:
new_titles.append(entity.title())
return new_titles
df['entity'] = title_fix(entities)
However, what you want to achieve can be done in a one-liner. I came up with 3 solutions. I don't know pandas that well and I have no idea about the performance differences between these solutions, but here they are.
ignored
makes a little bit more sense than myString
so I'll use it.
ignored = ['al Group1', 'al Group2']
First solution:
df['entity'] = df['entity'].apply(lambda x: x.title() if x not in ignored else x)
Second:
df.entity[~df.entity.isin(ignored)] = df.entity.str.title()
Third:
df.loc[~df.entity.isin(ignored), 'entity'] = df.entity.str.title()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.