简体   繁体   中英

Python - Splitting and replacing parts of a list with multiple replacements in a pandas dataframe

In Python, I have a list of places in a pandas dataframe that I want to reduce each string to match the format of a larger list, with the goal of merging the lists.

Ultimately, I want to make this list match the format of the other dataframe so that when I merge, I'm only merging rows where the "stop_name" column matches.

For example, out of the list below, I want to remove " STATION", so that "BOONTON STATION" becomes just "BOONTON".

However, I also want "BUTLER STATON (NEW JERSEY)" to become just "BUTLER", removing " STATION (NEW JERSEY)".

Lastly, for a 2-word station name I want to keep the second word, so that "MORRIS PLAINS STATION" becomes just "MORRIS PLAINS".

Basically I want to remove everything from one space from before the word "station" and everything after it on every row in the “stop_name” column.

I've tried various splits and replacements of strings and I'm either getting errors, or it's not making the replacement on every row.

Any direction to a viable solution would be appreciated.

stop_name
0   BOONTON STATION
1   BUTLER STATION (NEW JERSEY)
2   CONVENT STATION (NJ TRANSIT)
3   DOVER STATION (NJ TRANSIT)
4   LAKE HOPATCONG STATION
5   MADISON STATION (NJ TRANSIT)
6   MILLINGTON STATION
7   MORRIS PLAINS STATION
8   MORRISTOWN STATION
9   MOUNT ARLINGTON STATION
10  MOUNT TABOR STATION
12  POMPTON PLAINS STATION
13  TOWACO STATION

It seems you just want to replace pattern STATION.* with empty string:

df.stop_name.str.replace(' STATION.*', '')

0             BOONTON
1              BUTLER
2             CONVENT
3               DOVER
4      LAKE HOPATCONG
5             MADISON
6          MILLINGTON
7       MORRIS PLAINS
8          MORRISTOWN
9     MOUNT ARLINGTON
10        MOUNT TABOR
12     POMPTON PLAINS
13             TOWACO
Name: stop_name, dtype: object

A regular expression extract() is straight forward.

df = pd.read_csv(io.StringIO("""stop_name
0   BOONTON STATION
1   BUTLER STATION (NEW JERSEY)
2   CONVENT STATION (NJ TRANSIT)
3   DOVER STATION (NJ TRANSIT)
4   LAKE HOPATCONG STATION
5   MADISON STATION (NJ TRANSIT)
6   MILLINGTON STATION
7   MORRIS PLAINS STATION
8   MORRISTOWN STATION
9   MOUNT ARLINGTON STATION
10  MOUNT TABOR STATION
12  POMPTON PLAINS STATION
13  TOWACO STATION"""), sep="\s\s+", engine="python")

df.stop_name = df.stop_name.str.extract(r"(^.*) STATION.*$")


stop_name
0 BOONTON
1 BUTLER
2 CONVENT
3 DOVER
4 LAKE HOPATCONG
5 MADISON
6 MILLINGTON
7 MORRIS PLAINS
8 MORRISTOWN
9 MOUNT ARLINGTON
10 MOUNT TABOR
12 POMPTON PLAINS
13 TOWACO

Alternative without regular expression:

>>> df["stop_name"].str.split("STATION").str[0].str.strip()
0             BOONTON
1              BUTLER
2             CONVENT
3               DOVER
4      LAKE HOPATCONG
5             MADISON
6          MILLINGTON
7       MORRIS PLAINS
8          MORRISTOWN
9     MOUNT ARLINGTON
10        MOUNT TABOR
12     POMPTON PLAINS
13             TOWACO
Name: stop_name, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM