can I have solution for this let say, I have this
df['Location']
*run and i got this
0 New York, NY
1 Chantilly, VA
2 Boston, MA
3 Newton, MA
4 New York, NY
...
667 Fort Lee, NJ
668 San Francisco, CA
669 Irwindale, CA
670 San Francisco, CA
671 New York, NY
Name: Location, Length: 659, dtype: object
then I want to make it simplified by if it contain Nwe York, NY then I want it become NY. If it contain Boston, MA then I want it become MA. Etc
so I write this code
def clean_location_1(x):
if 'CA':
return 'CA'
elif 'NY':
return 'NY'
elif 'DC':
return 'DC'
elif 'MA':
return 'MA'
elif 'IL':
return 'IL'
elif 'VA':
return 'VA'
else:
return 'others'
df['Location'] = df['Location'].apply(clean_location_1)
but, when I run my script, all the Location become CA
how can I solve this?
One of possible solutions of solving this using your approach is the following. import pandas as pd
data = pd.DataFrame([{'location': 'New York, NY'},
{'location': 'Chantilly, VA'},
{'location': 'Boston, MA'},
{'location': 'Newton, MA'},
{'location': 'San Francisco, CA'}])
def clean_location_1(x):
if 'CA' in x:
return 'CA'
elif 'NY' in x:
return 'NY'
elif 'DC' in x:
return 'DC'
elif 'MA' in x:
return 'MA'
elif 'IL' in x:
return 'IL'
elif 'VA' in x:
return 'VA'
else:
return 'others'
data['location'].apply(clean_location_1)
Your problem was incorrect condition in the if/else block.
Another way of doing this might be.
list_states = ['CA', 'NY', 'DC', 'MA', 'IL', 'VA']
data['location'].apply(lambda x: x.split(' ')[-1] if x.split(' ')[-1] in list_states else 'others')
Then you won't need a huge if/else block.
When you write if 'CA'
it doesn't mean much, you have to check the value.
This should do it using pd.Series.str.contains
:
def clean_location_1(x):
if x.str.contains('CA'):
return 'CA'
elif x.str.contains('NY'):
return 'NY'
elif x.str.contains('DC'):
return 'DC'
elif x.str.contains('MA'):
return 'MA'
elif x.str.contains('IL'):
return 'IL'
elif x.str.contains('VA'):
return 'VA'
else:
return 'others'
The problem is simple. You are not comparing the string with x. And 'CA' will always return true as non empty strings are truthy. That is why everything changes to CA
Doing
if "<str>":
always returns True
and that means in your code, it will always return CA
. So, you can try this, check if x
is in <word>
or not.
def clean_location_1(x):
if 'CA' in x:
return 'CA'
elif 'NY' in x:
return 'NY'
elif 'DC' in x:
return 'DC'
elif 'MA' in x:
return 'MA'
elif 'IL' in x:
return 'IL'
elif 'VA' in x:
return 'VA'
else:
return 'others'
df['Location'] = df['Location'].apply(clean_location_1)
Or you can try this, which is easy, clean and simple:
check=["CA","NY","DC","MA","IL","VA"]
def clean_location_1(x):
y=x.rsplit(", ",1)[1]
if y in check:
return y
else:
return "others"
df['Location'] = df['Location'].apply(clean_location_1)
Here we are creating the list of short form
of locations, as you did in every if-else
statements and storing that in check
and checking that if x
has values
of check
or not.
Or one-liner solution, same as second approach but in one line:
check=["CA","NY","DC","MA","IL","VA"]
df['Location'] = df['Location'].apply(lambda x: x.rsplit(", ",1)[1] if x.rsplit(", ",1)[1] in check else "others")
You can do:
states = ['CA', 'NY', 'DC', 'MA', 'IL', 'VA']
df['State'] = df['Location'].str.split(', ', expand=True)[1] \
.rename('State').to_frame().query('State in @states')
df['State'] = df['State'].fillna('other')
>>> df
Location State
0 New York, NY NY
1 Chantilly, VA VA
2 Boston, MA MA
3 Newton, MA MA
4 New York, NY NY
5 Fort Lee, NJ other
6 San Francisco, CA CA
7 Irwindale, CA CA
8 San Francisco, CA CA
9 New York, NY NY
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.