简体   繁体   中英

Extract string from column following a specific pattern

Please forgive my panda newbie question, but I have a column of US towns and states, such as the truncated version shown below (For some strange reason, the name of the column is called 'Alabama[edit]' which is associated with the first 0-7 town values in the column):

0                          Auburn (Auburn University)[1]
1                 Florence (University of North Alabama)
2        Jacksonville (Jacksonville State University)[2]
3             Livingston (University of West Alabama)[2]
4               Montevallo (University of Montevallo)[2]
5                              Troy (Troy University)[2]
6      Tuscaloosa (University of Alabama, Stillman Co...
7                      Tuskegee (Tuskegee University)[5]
8                                           Alaska[edit]
9          Fairbanks (University of Alaska Fairbanks)[2]
10                                         Arizona[edit]
11            Flagstaff (Northern Arizona University)[6]
12                      Tempe (Arizona State University)
13                        Tucson (University of Arizona)
14                                        Arkansas[edit]
15     Arkadelphia (Henderson State University, Ouach...
16     Conway (Central Baptist College, Hendrix Colle...
17              Fayetteville (University of Arkansas)[7]
18              Jonesboro (Arkansas State University)[8]
19            Magnolia (Southern Arkansas University)[2]
20     Monticello (University of Arkansas at Monticel...
21            Russellville (Arkansas Tech University)[2]
22                        Searcy (Harding University)[5]
23                                      California[edit]

The towns that are in each state are below each state name, eg Fairbanks (column value 9) is a town in the state of Alaska.

What I want to do is to split up the town names based on the state names so that I have two columns 'State' and 'RegionName' where each state name is associated with each town name, like so:

                            RegionName                       State
0                          Auburn (Auburn University)[1]    Alabama
1                 Florence (University of North Alabama)    Alabama
2        Jacksonville (Jacksonville State University)[2]    Alabama
3             Livingston (University of West Alabama)[2]    Alabama
4               Montevallo (University of Montevallo)[2]    Alabama
5                              Troy (Troy University)[2]    Alabama
6      Tuscaloosa (University of Alabama, Stillman Co...    Alabama
7                      Tuskegee (Tuskegee University)[5]    Alabama

8         Fairbanks (University of Alaska Fairbanks)[2]     Alaska

9         Flagstaff (Northern Arizona University)[6]        Arizona
10                      Tempe (Arizona State University)    Arizona
11                        Tucson (University of Arizona)    Arizona                                              

12        Arkadelphia (Henderson State University, Ouach... Arkansas                                           

. . .etc.

I know that each state name is followed by a string '[edit]', which I assume I can use to do the split and assignment of the town names. But I don't know how to do this.

Also, I know that there's a lot of other data cleaning I need to do, such as removing the strings within parentheses and within the brackets '[]'. That can be done later...the important part is splitting up the states and towns and assigning each town to its proper US Any advice would be most appreciated.

Without much context or access to your data, I'd suggest something along these lines. First, modify the code that reads your data:

df = pd.read_csv(..., header=None, names=['RegionName']) 
# add header=False so as to read the first row as data

Now, extract the state name using str.extract , this should only extract names as long as they are succeeded by the substring "[edit]". You can then forward fill all NaN values using ffill .

df['State'] = df['RegionName'].str.extract(
    r'(?P<State>.*)(?=\s*\[edit\])'
).ffill()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM