简体   繁体   English

使用正则表达式删除不需要的字符串结尾

[英]Using regex to remove unwanted end of a string

I'm struggling a little with some regex execution to remove trailing extraneous characters. 我在一些正则表达式执行方面有点挣扎,以删除结尾的多余字符。 I've tried a few ideas that I found here, but none are quite what I'm looking for. 我尝试了一些在这里找到的想法,但都没有找到我想要的。

Data looks like this (only one column of data): 数据如下所示(仅一列数据):

City1[edit]

City2 (University Name)

City with a Space (University Name)

Etc.

Basically, the trouble that I run into here is I can't necessarily remove everything after a space because sometimes a city name includes a space ("New York City"). 基本上,我在这里遇到的麻烦是我不一定必须删除空格后的所有内容,因为有时一个城市名称会包含一个空格(“纽约市”)。

However, what I think I could do is a three step approach: 但是,我认为我可以做的是三步走法:

  1. Replace anything between [],(),{} sets of characters (this will remove the "edit" and the "University Name" in the sample data. 替换[],(),{}组字符之间的任何内容(这将删除示例数据中的“编辑”和“大学名称”。
  2. Replace the [],(),{} type characters since those are now extra characters. 替换[],(),{}类型的字符,因为它们现在是多余的字符。
  3. Trim any trailing spaces (which will leave the spaces in city names such as St. Paul) 修剪所有尾随空格(将在城市名称中保留空格,例如圣保罗)

I have two main questions: 1. Is there a way to do this in one command, or will it have to be three separate commands? 我有两个主要问题:1.是否可以在一个命令中执行此操作,还是必须使用三个单独的命令? 2. How do you remove characters in between specific characters using regex? 2.如何使用正则表达式删除特定字符之间的字符?

Code that I have attempted: 我尝试过的代码:

  1. DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True) ---however this only replaced the final iteration of the special characters DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True) -但是,这仅替换了特殊字符的最终迭代

  2. DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True) --unfortunately this just replaced everything, leaving all my data blank DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True) inplace DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True)不幸的是,这只是替换了所有内容,而我所有的数据都为空白

If you always know the bracket characters that will come first you can do: 如果您始终知道最先出现的括号字符,则可以执行以下操作:

Create data 建立资料

df=pd.DataFrame({'names':['City1[edit]', 
                          'City2 (University Name)', 
                           'City with a Space {University Name}']})

Then replace everything after first bracket. 然后在第一个支架之后更换所有东西。

df.names.str.replace('\[.*|\(.*|\{.*', '').str.strip()

Output 产量

0                City1
1                City2
2    City with a Space

A regexp would be a relatively easy way to do this. 正则表达式将是一个相对简单的方法。

import re

p = re.compile('(\(|\[|\{)[A-Za-z\ ].+(\)|\]|\})')
dirty = 'City with a Space (University Name)'
cleaned = p.sub('', dirty).strip()
print(cleaned)

option with split split选项
look for zero or one space followed by a [ , ( , or { 寻找零或一个空格,后跟[({
split at that point and take first part 在那时分裂并参与第一部分

df.names.str.split(r'\s*[\[\{\(]').str[0]

0                City1
1                City2
2    City with a Space
Name: names, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM