[英]Python - Splitting and replacing parts of a list with multiple replacements in a pandas dataframe
In Python, I have a list of places in a pandas dataframe that I want to reduce each string to match the format of a larger list, with the goal of merging the lists.在 Python 中,我有一个 pandas dataframe 中的位置列表,我想减少每个字符串以匹配更大列表的格式,并以列表为目标。
Ultimately, I want to make this list match the format of the other dataframe so that when I merge, I'm only merging rows where the "stop_name" column matches.最终,我想让这个列表与另一个 dataframe 的格式相匹配,这样当我合并时,我只合并“stop_name”列匹配的行。
For example, out of the list below, I want to remove " STATION", so that "BOONTON STATION" becomes just "BOONTON".例如,在下面的列表中,我想删除“STATION”,这样“BOONTON STATION”就变成了“BOONTON”。
However, I also want "BUTLER STATON (NEW JERSEY)" to become just "BUTLER", removing " STATION (NEW JERSEY)".但是,我也希望“BUTLER STATON (NEW JERSEY)”变成“BUTLER”,删除“STATION (NEW JERSEY)”。
Lastly, for a 2-word station name I want to keep the second word, so that "MORRIS PLAINS STATION" becomes just "MORRIS PLAINS".最后,对于两个单词的站名,我想保留第二个单词,这样“MORRIS PLAINS STATION”就变成了“MORRIS PLAINS”。
Basically I want to remove everything from one space from before the word "station" and everything after it on every row in the “stop_name” column.基本上,我想从“站”一词之前的一个空格中删除所有内容,以及“stop_name”列中每一行的所有内容。
I've tried various splits and replacements of strings and I'm either getting errors, or it's not making the replacement on every row.我尝试了各种拆分和替换字符串,但我要么遇到错误,要么没有在每一行上进行替换。
Any direction to a viable solution would be appreciated.任何可行的解决方案的方向将不胜感激。
stop_name
0 BOONTON STATION
1 BUTLER STATION (NEW JERSEY)
2 CONVENT STATION (NJ TRANSIT)
3 DOVER STATION (NJ TRANSIT)
4 LAKE HOPATCONG STATION
5 MADISON STATION (NJ TRANSIT)
6 MILLINGTON STATION
7 MORRIS PLAINS STATION
8 MORRISTOWN STATION
9 MOUNT ARLINGTON STATION
10 MOUNT TABOR STATION
12 POMPTON PLAINS STATION
13 TOWACO STATION
It seems you just want to replace pattern STATION.*
with empty string:看来您只想用空字符串替换模式STATION.*
:
df.stop_name.str.replace(' STATION.*', '')
0 BOONTON
1 BUTLER
2 CONVENT
3 DOVER
4 LAKE HOPATCONG
5 MADISON
6 MILLINGTON
7 MORRIS PLAINS
8 MORRISTOWN
9 MOUNT ARLINGTON
10 MOUNT TABOR
12 POMPTON PLAINS
13 TOWACO
Name: stop_name, dtype: object
A regular expression extract()
is straight forward.正则表达式extract()
是直截了当的。
df = pd.read_csv(io.StringIO("""stop_name
0 BOONTON STATION
1 BUTLER STATION (NEW JERSEY)
2 CONVENT STATION (NJ TRANSIT)
3 DOVER STATION (NJ TRANSIT)
4 LAKE HOPATCONG STATION
5 MADISON STATION (NJ TRANSIT)
6 MILLINGTON STATION
7 MORRIS PLAINS STATION
8 MORRISTOWN STATION
9 MOUNT ARLINGTON STATION
10 MOUNT TABOR STATION
12 POMPTON PLAINS STATION
13 TOWACO STATION"""), sep="\s\s+", engine="python")
df.stop_name = df.stop_name.str.extract(r"(^.*) STATION.*$")
stop_name停止名称 | |
---|---|
0 0 | BOONTON布顿 |
1 1 | BUTLER管家 |
2 2 | CONVENT修道院 |
3 3 | DOVER多佛 |
4 4 | LAKE HOPATCONG霍帕聪湖 |
5 5 | MADISON麦迪逊 |
6 6 | MILLINGTON米灵顿 |
7 7 | MORRIS PLAINS莫里斯平原 |
8 8 | MORRISTOWN莫里斯敦 |
9 9 | MOUNT ARLINGTON阿灵顿山 |
10 10 | MOUNT TABOR泰伯山 |
12 12 | POMPTON PLAINS庞普顿平原 |
13 13 | TOWACO托瓦科 |
Alternative without regular expression:没有正则表达式的替代方案:
>>> df["stop_name"].str.split("STATION").str[0].str.strip()
0 BOONTON
1 BUTLER
2 CONVENT
3 DOVER
4 LAKE HOPATCONG
5 MADISON
6 MILLINGTON
7 MORRIS PLAINS
8 MORRISTOWN
9 MOUNT ARLINGTON
10 MOUNT TABOR
12 POMPTON PLAINS
13 TOWACO
Name: stop_name, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.