This question is a follow up to Pietro's fantastic answer on how to split a column into multiple columns. My goal is to take a column from an existing data frame, split it on a space, and then take the first three/four split values and place each in a particular column, ignoring the remainder.
The issue with this split is that the number of whitespace varies between rows. Sometimes the data appears like "Fort Lee NJ 07024." Other times, it appears like "NY NY 10000." I'm not sure if there's an easy fix.
df['City, State, Zip'].str.split()
# Returns a variable length row.
# I need to take the first three or four values, and add them to columns: City/State/Zip
Assuming that state and zip are always present and contain valid data, one method to solve this problem is to first split your string. The state and zip are simply the second to last and last columns, respectively. I've used a list comprehension to extract them from city_state_zip
. To extract the city, I've used a nested list comprehension together with join
. The last two elements are the state and zip, so the length of the list minus two tells you how many elements are contained in the city name. You then just need to join them with a space.
df = pd.DataFrame({'city_state_zip': ['Fort Lee NJ 07024',
'NY NY 10000',
'Carmel by the Sea CA 93922']})
city_state_zip = df.city_state_zip.apply(lambda x: x.split())
df['city'] = [" ".join([x[c] for c in range(len(x) - 2)]) for x in city_state_zip]
df['state'] = [x[-2] for x in city_state_zip]
df['zip'] = [x[-1] for x in city_state_zip]
>>> df
city_state_zip city state zip
0 Fort Lee NJ 07024 Fort Lee NJ 07024
1 NY NY 10000 NY NY 10000
2 Carmel by the Sea CA 93922 Carmel by the Sea CA 93922
EDIT: As suggested by DSM, it looks like the last two words are the state an zip code, in which case you can do
df = pd.DataFrame({'city_state_zip': ['Fort Lee NJ 07024',
'NY NY 10000',
'Carmel by the Sea CA 93922']})
In [50]: regex = '(?P<City>[a-zA-z ]*) (?P<State>[A-Z]{2}) (?P<Zip>[\d-]*)'
df.city_state_zip.str.extract(regex)
Out[50]:
City State Zip
0 Fort Lee NJ 07024
1 NY NY 10000
2 Carmel by the Sea CA 93922
This method uses extraction by regex using multiple named groups, one each for City, State and Zip. The result of the extract method is a dataframe with 3 columns as shown. The syntax for groups is to surround the regex for each group by a bracket. For naming a group insert ?P<group name>
in the brackets before the group regex. This solution assumes city names contain only upper and lower case letters and spaces and stats abbrev. contain exactly 2 capital letters but you can adjust it if this isn't the case. Note that the spaces between the groups in the regex are important here as they represent the spaces between the city, state and zip.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.