使用正则表达式删除不需要的字符串结尾

Question

I'm struggling a little with some regex execution to remove trailing extraneous characters. 我在一些正则表达式执行方面有点挣扎，以删除结尾的多余字符。 I've tried a few ideas that I found here, but none are quite what I'm looking for. 我尝试了一些在这里找到的想法，但都没有找到我想要的。

Data looks like this (only one column of data): 数据如下所示（仅一列数据）：

City1[edit]

City2 (University Name)

City with a Space (University Name)

Etc.

Basically, the trouble that I run into here is I can't necessarily remove everything after a space because sometimes a city name includes a space ("New York City"). 基本上，我在这里遇到的麻烦是我不一定必须删除空格后的所有内容，因为有时一个城市名称会包含一个空格（“纽约市”）。

However, what I think I could do is a three step approach: 但是，我认为我可以做的是三步走法：

Replace anything between [],(),{} sets of characters (this will remove the "edit" and the "University Name" in the sample data. 替换[]，（），{}组字符之间的任何内容（这将删除示例数据中的“编辑”和“大学名称”。
Replace the [],(),{} type characters since those are now extra characters. 替换[]，（），{}类型的字符，因为它们现在是多余的字符。
Trim any trailing spaces (which will leave the spaces in city names such as St. Paul) 修剪所有尾随空格（将在城市名称中保留空格，例如圣保罗）

I have two main questions: 1. Is there a way to do this in one command, or will it have to be three separate commands? 我有两个主要问题：1.是否可以在一个命令中执行此操作，还是必须使用三个单独的命令？ 2. How do you remove characters in between specific characters using regex? 2.如何使用正则表达式删除特定字符之间的字符？

Code that I have attempted: 我尝试过的代码：

DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True) ---however this only replaced the final iteration of the special characters DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True) -但是，这仅替换了特殊字符的最终迭代
DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True) --unfortunately this just replaced everything, leaving all my data blank DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True) inplace DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True)不幸的是，这只是替换了所有内容，而我所有的数据都为空白

Answer 1

If you always know the bracket characters that will come first you can do: 如果您始终知道最先出现的括号字符，则可以执行以下操作：

Create data 建立资料

df=pd.DataFrame({'names':['City1[edit]', 
                          'City2 (University Name)', 
                           'City with a Space {University Name}']})

Then replace everything after first bracket. 然后在第一个支架之后更换所有东西。

df.names.str.replace('\[.*|\(.*|\{.*', '').str.strip()

Output 产量

0                City1
1                City2
2    City with a Space

Answer 2

A regexp would be a relatively easy way to do this. 正则表达式将是一个相对简单的方法。

import re

p = re.compile('(\(|\[|\{)[A-Za-z\ ].+(\)|\]|\})')
dirty = 'City with a Space (University Name)'
cleaned = p.sub('', dirty).strip()
print(cleaned)

Answer 3

option with split split选项
look for zero or one space followed by a [ , ( , or { 寻找零或一个空格，后跟[ ， (或{
split at that point and take first part 在那时分裂并参与第一部分

df.names.str.split(r'\s*[\[\{\(]').str[0]

0                City1
1                City2
2    City with a Space
Name: names, dtype: object

使用正则表达式删除不需要的字符串结尾

问题描述

3 个解决方案

解决方案1
3 已采纳 2016-12-20 21:05:09

解决方案2
0 2016-12-20 21:09:25

解决方案3
0 2016-12-20 21:16:52

使用正则表达式删除不需要的字符串结尾

问题描述

3 个解决方案

解决方案1 3 已采纳 2016-12-20 21:05:09

解决方案2 0 2016-12-20 21:09:25

解决方案3 0 2016-12-20 21:16:52

解决方案1
3 已采纳 2016-12-20 21:05:09

解决方案2
0 2016-12-20 21:09:25

解决方案3
0 2016-12-20 21:16:52