[英]Using regex to remove unwanted end of a string
I'm struggling a little with some regex execution to remove trailing extraneous characters. 我在一些正则表达式执行方面有点挣扎,以删除结尾的多余字符。 I've tried a few ideas that I found here, but none are quite what I'm looking for. 我尝试了一些在这里找到的想法,但都没有找到我想要的。
Data looks like this (only one column of data): 数据如下所示(仅一列数据):
City1[edit]
City2 (University Name)
City with a Space (University Name)
Etc.
Basically, the trouble that I run into here is I can't necessarily remove everything after a space because sometimes a city name includes a space ("New York City"). 基本上,我在这里遇到的麻烦是我不一定必须删除空格后的所有内容,因为有时一个城市名称会包含一个空格(“纽约市”)。
However, what I think I could do is a three step approach: 但是,我认为我可以做的是三步走法:
I have two main questions: 1. Is there a way to do this in one command, or will it have to be three separate commands? 我有两个主要问题:1.是否可以在一个命令中执行此操作,还是必须使用三个单独的命令? 2. How do you remove characters in between specific characters using regex? 2.如何使用正则表达式删除特定字符之间的字符?
Code that I have attempted: 我尝试过的代码:
DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True)
---however this only replaced the final iteration of the special characters DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True)
-但是,这仅替换了特殊字符的最终迭代
DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True)
--unfortunately this just replaced everything, leaving all my data blank DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True)
inplace DF[0].replace(r'[\\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True)
不幸的是,这只是替换了所有内容,而我所有的数据都为空白
If you always know the bracket characters that will come first you can do: 如果您始终知道最先出现的括号字符,则可以执行以下操作:
Create data 建立资料
df=pd.DataFrame({'names':['City1[edit]',
'City2 (University Name)',
'City with a Space {University Name}']})
Then replace everything after first bracket. 然后在第一个支架之后更换所有东西。
df.names.str.replace('\[.*|\(.*|\{.*', '').str.strip()
Output 产量
0 City1
1 City2
2 City with a Space
A regexp would be a relatively easy way to do this. 正则表达式将是一个相对简单的方法。
import re
p = re.compile('(\(|\[|\{)[A-Za-z\ ].+(\)|\]|\})')
dirty = 'City with a Space (University Name)'
cleaned = p.sub('', dirty).strip()
print(cleaned)
option with split
split
选项
look for zero or one space followed by a [
, (
, or {
寻找零或一个空格,后跟[
, (
或{
split at that point and take first part 在那时分裂并参与第一部分
df.names.str.split(r'\s*[\[\{\(]').str[0]
0 City1
1 City2
2 City with a Space
Name: names, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.