简体   繁体   中英

Split text in a column into three columns

This question is a follow up to Pietro's fantastic answer on how to split a column into multiple columns. My goal is to take a column from an existing data frame, split it on a space, and then take the first three/four split values and place each in a particular column, ignoring the remainder.

The issue with this split is that the number of whitespace varies between rows. Sometimes the data appears like "Fort Lee NJ 07024." Other times, it appears like "NY NY 10000." I'm not sure if there's an easy fix.

df['City, State, Zip'].str.split()
# Returns a variable length row. 
# I need to take the first three or four values, and add them to columns: City/State/Zip

Assuming that state and zip are always present and contain valid data, one method to solve this problem is to first split your string. The state and zip are simply the second to last and last columns, respectively. I've used a list comprehension to extract them from city_state_zip . To extract the city, I've used a nested list comprehension together with join . The last two elements are the state and zip, so the length of the list minus two tells you how many elements are contained in the city name. You then just need to join them with a space.

df = pd.DataFrame({'city_state_zip': ['Fort Lee NJ 07024', 
                                      'NY NY 10000', 
                                      'Carmel by the Sea CA 93922']})

city_state_zip = df.city_state_zip.apply(lambda x: x.split())
df['city'] = [" ".join([x[c] for c in range(len(x) - 2)]) for x in city_state_zip]
df['state'] = [x[-2] for x in city_state_zip]
df['zip'] = [x[-1] for x in city_state_zip]
>>> df
               city_state_zip               city state    zip
0           Fort Lee NJ 07024           Fort Lee    NJ  07024
1                 NY NY 10000                 NY    NY  10000
2  Carmel by the Sea CA 93922  Carmel by the Sea    CA  93922

EDIT: As suggested by DSM, it looks like the last two words are the state an zip code, in which case you can do

df = pd.DataFrame({'city_state_zip': ['Fort Lee NJ 07024', 
                                      'NY NY 10000', 
                                      'Carmel by the Sea CA 93922']})

In [50]: regex = '(?P<City>[a-zA-z ]*) (?P<State>[A-Z]{2}) (?P<Zip>[\d-]*)'
         df.city_state_zip.str.extract(regex)
Out[50]:
    City             State  Zip
0   Fort Lee            NJ  07024
1   NY                  NY  10000
2   Carmel by the Sea   CA  93922

This method uses extraction by regex using multiple named groups, one each for City, State and Zip. The result of the extract method is a dataframe with 3 columns as shown. The syntax for groups is to surround the regex for each group by a bracket. For naming a group insert ?P<group name> in the brackets before the group regex. This solution assumes city names contain only upper and lower case letters and spaces and stats abbrev. contain exactly 2 capital letters but you can adjust it if this isn't the case. Note that the spaces between the groups in the regex are important here as they represent the spaces between the city, state and zip.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM