[英]python - remove whitespace between two characters using re.sub
I have a pair of columns, like so:我有一对列,如下所示:
x = ["a b williams", "e g", "z z specialists"]
y = ["j j winston", "hb d party supplies", "t t ice cream"]
df = pd.DataFrame(x,y)
I would like to be able to remove the white space between two single characters using re.sub
.我希望能够使用
re.sub
删除两个单个字符之间的空格。 I have tried the following:我尝试了以下方法:
re.sub("(?<=\\w\\b)"\\s"(?=\\w\\b)", "", df)
However, when I run the code, I get the following error.但是,当我运行代码时,出现以下错误。
SyntaxError: unexpected character after line continuation character
I'm unsure of what I am doing wrong.我不确定我做错了什么。 The desired result is:
期望的结果是:
jj winston ab williams
hb d party supplies eg
tt ice cream zz specialists
Please advise.请指教。 Any advice is appreciated.
任何建议表示赞赏。
You can use您可以使用
(?<=\b[^\W\d_])\s(?=[^\W\d_]\b)
(?<=\b\w)\s(?=\w\b)
See the regex demo .请参阅正则表达式演示。 Note the
[^\W\d_]
pattern matches any Unicode letter in Python re
.请注意
[^\W\d_]
模式匹配 Python re
中的任何 Unicode 字母。 \w
matches Unicode letters, digits, _
and some diacritics and other connector punctuation. \w
匹配 Unicode 字母、数字、 _
和一些变音符号和其他连接符标点符号。
Details细节
(?<=\b[^\W\d_])
- a positive lookbehind that matches a location that is immediately preceded with a single letter as a whole word (as it is prepended with a word boundary) (?<=\b[^\W\d_])
- 一个正向的后视,它匹配一个紧接在一个字母前面的位置作为一个完整的单词(因为它前面有一个单词边界)\s
- a whitespace char \s
- 一个空白字符(?=[^\W\d_]\b)
- a positive lookahead that matches a location that is immediately followed with a single letter as a whole word (as it is followed with a word boundary). (?=[^\W\d_]\b)
- 一个正向前瞻,它匹配一个紧跟一个字母作为整个单词的位置(因为它后面跟着一个单词边界)。 Here is a Pandas demo:这是一个 Pandas 演示:
x = ["a b williams", "e g", "z z specialists"]
y = ["j j winston", "h d party supplies", "t t ice cream"]
df = pd.DataFrame(x,y)
rx = r'(?<=\b[^\W\d_])\s(?=[^\W\d_]\b)'
df.index = df.index.to_series().replace(rx, '', regex=True)
df = df.replace(rx, '', regex=True)
# => df
# 0
# jj winston ab williams
# hd party supplies eg
# tt ice cream zz specialists
As DataFrame.replace
with regex=True
does not touch the index column, it must be handled separately, hence the df.index = df.index.to_series().replace(rx, '', regex=True)
line of code is added.由于
DataFrame.replace
with regex=True
不涉及索引列,因此必须单独处理,因此df.index = df.index.to_series().replace(rx, '', regex=True)
行代码是添加。
Your regex is pretty close to the required and can be slightly modified as follows:您的正则表达式非常接近要求,可以稍作修改,如下所示:
r'(?<=\b\w)(\s)(?=\w\b)'
Note to use the raw quote r'...' so that you don't need double \ for in the regex.请注意使用原始引号 r'...' 以便您在正则表达式中不需要双 \ for。
Better compile the regex to speed up the processing as it is used multiple times更好地编译正则表达式以加快处理速度,因为它被多次使用
pattern = re.compile(r'(?<=\b\w)(\s)(?=\w\b)')
Then reuse your codes:然后重用您的代码:
x = ["a b williams", "e g", "z z specialists"]
y = ["j j winston", "h d party supplies", "t t ice cream"]
df = pd.DataFrame(x,y)
Convert the index:转换索引:
df.index = df.index.to_series().str.replace(pattern, '')
Convert the data column:转换数据列:
df[0] = df[0].str.replace(pattern, '')
Explanation of your errors:您的错误解释:
Using re.sub
, I suggest the following:使用
re.sub
,我建议如下:
# your lists
x = ["a b williams", "e g", "z z specialists"]
y = ["j j winston", "hb d party supplies", "t t ice cream"]
# replacements
x = [re.sub(r'(\b\w)(\s)(\w\b)', r'\1\3', el) for el in x]
y = [re.sub(r'(\b\w)(\s)(\w\b)', r'\1\3', el) for el in y]
# pd dataframe after the process
df = pd.DataFrame(x,y)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.