简体   繁体   中英

Using regex to remove unwanted end of a string

I'm struggling a little with some regex execution to remove trailing extraneous characters. I've tried a few ideas that I found here, but none are quite what I'm looking for.

Data looks like this (only one column of data):

City1[edit]

City2 (University Name)

City with a Space (University Name)

Etc.

Basically, the trouble that I run into here is I can't necessarily remove everything after a space because sometimes a city name includes a space ("New York City").

However, what I think I could do is a three step approach:

  1. Replace anything between [],(),{} sets of characters (this will remove the "edit" and the "University Name" in the sample data.
  2. Replace the [],(),{} type characters since those are now extra characters.
  3. Trim any trailing spaces (which will leave the spaces in city names such as St. Paul)

I have two main questions: 1. Is there a way to do this in one command, or will it have to be three separate commands? 2. How do you remove characters in between specific characters using regex?

Code that I have attempted:

  1. DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True) ---however this only replaced the final iteration of the special characters

  2. DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True) --unfortunately this just replaced everything, leaving all my data blank

If you always know the bracket characters that will come first you can do:

Create data

df=pd.DataFrame({'names':['City1[edit]', 
                          'City2 (University Name)', 
                           'City with a Space {University Name}']})

Then replace everything after first bracket.

df.names.str.replace('\[.*|\(.*|\{.*', '').str.strip()

Output

0                City1
1                City2
2    City with a Space

A regexp would be a relatively easy way to do this.

import re

p = re.compile('(\(|\[|\{)[A-Za-z\ ].+(\)|\]|\})')
dirty = 'City with a Space (University Name)'
cleaned = p.sub('', dirty).strip()
print(cleaned)

option with split
look for zero or one space followed by a [ , ( , or {
split at that point and take first part

df.names.str.split(r'\s*[\[\{\(]').str[0]

0                City1
1                City2
2    City with a Space
Name: names, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM