简体   繁体   English

Python - 在 pandas dataframe 中使用多个替换来拆分和替换列表的部分

[英]Python - Splitting and replacing parts of a list with multiple replacements in a pandas dataframe

In Python, I have a list of places in a pandas dataframe that I want to reduce each string to match the format of a larger list, with the goal of merging the lists.在 Python 中,我有一个 pandas dataframe 中的位置列表,我想减少每个字符串以匹配更大列表的格式,并以列表为目标。

Ultimately, I want to make this list match the format of the other dataframe so that when I merge, I'm only merging rows where the "stop_name" column matches.最终,我想让这个列表与另一个 dataframe 的格式相匹配,这样当我合并时,我只合并“stop_name”列匹配的行。

For example, out of the list below, I want to remove " STATION", so that "BOONTON STATION" becomes just "BOONTON".例如,在下面的列表中,我想删除“STATION”,这样“BOONTON STATION”就变成了“BOONTON”。

However, I also want "BUTLER STATON (NEW JERSEY)" to become just "BUTLER", removing " STATION (NEW JERSEY)".但是,我也希望“BUTLER STATON (NEW JERSEY)”变成“BUTLER”,删除“STATION (NEW JERSEY)”。

Lastly, for a 2-word station name I want to keep the second word, so that "MORRIS PLAINS STATION" becomes just "MORRIS PLAINS".最后,对于两个单词的站名,我想保留第二个单词,这样“MORRIS PLAINS STATION”就变成了“MORRIS PLAINS”。

Basically I want to remove everything from one space from before the word "station" and everything after it on every row in the “stop_name” column.基本上,我想从“站”一词之前的一个空格中删除所有内容,以及“stop_name”列中每一行的所有内容。

I've tried various splits and replacements of strings and I'm either getting errors, or it's not making the replacement on every row.我尝试了各种拆分和替换字符串,但我要么遇到错误,要么没有在每一行上进行替换。

Any direction to a viable solution would be appreciated.任何可行的解决方案的方向将不胜感激。

stop_name
0   BOONTON STATION
1   BUTLER STATION (NEW JERSEY)
2   CONVENT STATION (NJ TRANSIT)
3   DOVER STATION (NJ TRANSIT)
4   LAKE HOPATCONG STATION
5   MADISON STATION (NJ TRANSIT)
6   MILLINGTON STATION
7   MORRIS PLAINS STATION
8   MORRISTOWN STATION
9   MOUNT ARLINGTON STATION
10  MOUNT TABOR STATION
12  POMPTON PLAINS STATION
13  TOWACO STATION

It seems you just want to replace pattern STATION.* with empty string:看来您只想用空字符串替换模式STATION.*

df.stop_name.str.replace(' STATION.*', '')

0             BOONTON
1              BUTLER
2             CONVENT
3               DOVER
4      LAKE HOPATCONG
5             MADISON
6          MILLINGTON
7       MORRIS PLAINS
8          MORRISTOWN
9     MOUNT ARLINGTON
10        MOUNT TABOR
12     POMPTON PLAINS
13             TOWACO
Name: stop_name, dtype: object

A regular expression extract() is straight forward.正则表达式extract()是直截了当的。

df = pd.read_csv(io.StringIO("""stop_name
0   BOONTON STATION
1   BUTLER STATION (NEW JERSEY)
2   CONVENT STATION (NJ TRANSIT)
3   DOVER STATION (NJ TRANSIT)
4   LAKE HOPATCONG STATION
5   MADISON STATION (NJ TRANSIT)
6   MILLINGTON STATION
7   MORRIS PLAINS STATION
8   MORRISTOWN STATION
9   MOUNT ARLINGTON STATION
10  MOUNT TABOR STATION
12  POMPTON PLAINS STATION
13  TOWACO STATION"""), sep="\s\s+", engine="python")

df.stop_name = df.stop_name.str.extract(r"(^.*) STATION.*$")


stop_name停止名称
0 0 BOONTON布顿
1 1 BUTLER管家
2 2 CONVENT修道院
3 3 DOVER多佛
4 4 LAKE HOPATCONG霍帕聪湖
5 5 MADISON麦迪逊
6 6 MILLINGTON米灵顿
7 7 MORRIS PLAINS莫里斯平原
8 8 MORRISTOWN莫里斯敦
9 9 MOUNT ARLINGTON阿灵顿山
10 10 MOUNT TABOR泰伯山
12 12 POMPTON PLAINS庞普顿平原
13 13 TOWACO托瓦科

Alternative without regular expression:没有正则表达式的替代方案:

>>> df["stop_name"].str.split("STATION").str[0].str.strip()
0             BOONTON
1              BUTLER
2             CONVENT
3               DOVER
4      LAKE HOPATCONG
5             MADISON
6          MILLINGTON
7       MORRIS PLAINS
8          MORRISTOWN
9     MOUNT ARLINGTON
10        MOUNT TABOR
12     POMPTON PLAINS
13             TOWACO
Name: stop_name, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM