In Python, I have a list of places in a pandas dataframe that I want to reduce each string to match the format of a larger list, with the goal of merging the lists.
Ultimately, I want to make this list match the format of the other dataframe so that when I merge, I'm only merging rows where the "stop_name" column matches.
For example, out of the list below, I want to remove " STATION", so that "BOONTON STATION" becomes just "BOONTON".
However, I also want "BUTLER STATON (NEW JERSEY)" to become just "BUTLER", removing " STATION (NEW JERSEY)".
Lastly, for a 2-word station name I want to keep the second word, so that "MORRIS PLAINS STATION" becomes just "MORRIS PLAINS".
Basically I want to remove everything from one space from before the word "station" and everything after it on every row in the “stop_name” column.
I've tried various splits and replacements of strings and I'm either getting errors, or it's not making the replacement on every row.
Any direction to a viable solution would be appreciated.
stop_name
0 BOONTON STATION
1 BUTLER STATION (NEW JERSEY)
2 CONVENT STATION (NJ TRANSIT)
3 DOVER STATION (NJ TRANSIT)
4 LAKE HOPATCONG STATION
5 MADISON STATION (NJ TRANSIT)
6 MILLINGTON STATION
7 MORRIS PLAINS STATION
8 MORRISTOWN STATION
9 MOUNT ARLINGTON STATION
10 MOUNT TABOR STATION
12 POMPTON PLAINS STATION
13 TOWACO STATION
It seems you just want to replace pattern STATION.*
with empty string:
df.stop_name.str.replace(' STATION.*', '')
0 BOONTON
1 BUTLER
2 CONVENT
3 DOVER
4 LAKE HOPATCONG
5 MADISON
6 MILLINGTON
7 MORRIS PLAINS
8 MORRISTOWN
9 MOUNT ARLINGTON
10 MOUNT TABOR
12 POMPTON PLAINS
13 TOWACO
Name: stop_name, dtype: object
A regular expression extract()
is straight forward.
df = pd.read_csv(io.StringIO("""stop_name
0 BOONTON STATION
1 BUTLER STATION (NEW JERSEY)
2 CONVENT STATION (NJ TRANSIT)
3 DOVER STATION (NJ TRANSIT)
4 LAKE HOPATCONG STATION
5 MADISON STATION (NJ TRANSIT)
6 MILLINGTON STATION
7 MORRIS PLAINS STATION
8 MORRISTOWN STATION
9 MOUNT ARLINGTON STATION
10 MOUNT TABOR STATION
12 POMPTON PLAINS STATION
13 TOWACO STATION"""), sep="\s\s+", engine="python")
df.stop_name = df.stop_name.str.extract(r"(^.*) STATION.*$")
stop_name | |
---|---|
0 | BOONTON |
1 | BUTLER |
2 | CONVENT |
3 | DOVER |
4 | LAKE HOPATCONG |
5 | MADISON |
6 | MILLINGTON |
7 | MORRIS PLAINS |
8 | MORRISTOWN |
9 | MOUNT ARLINGTON |
10 | MOUNT TABOR |
12 | POMPTON PLAINS |
13 | TOWACO |
Alternative without regular expression:
>>> df["stop_name"].str.split("STATION").str[0].str.strip()
0 BOONTON
1 BUTLER
2 CONVENT
3 DOVER
4 LAKE HOPATCONG
5 MADISON
6 MILLINGTON
7 MORRIS PLAINS
8 MORRISTOWN
9 MOUNT ARLINGTON
10 MOUNT TABOR
12 POMPTON PLAINS
13 TOWACO
Name: stop_name, dtype: object
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.