简体   繁体   English

如何仅删除出现在数字之后和带有 RegEx 的字母之前的空格?

[英]How to remove only the spaces that occur after a digit, and before a letter with RegEx?

I have a pandas column, shown below, having days and hours in different languages.我有一个 pandas 列,如下所示,有不同语言的日期和时间。 I would like to standardize this column to cumulative hours.我想将此列标准化为累计小时数。 Unfortunately there are spaces, and before I can do any further processing, I would like to remove only the spaces that occur after a digit, and before a letter.不幸的是有空格,在我可以做任何进一步处理之前,我想只删除出现在数字之后和字母之前的空格。 Hence keeping the space between the day and hour or between the hour and minute indicators.因此,在日期和小时之间或小时和分钟指示器之间保持空间。

If this problem is solved, then the next approach I would do is to split the column by the remaining space, and apply a custom mapper to process the rest of the data.如果这个问题解决了,那么我要做的下一个方法是用剩余空间拆分列,并应用自定义映射器来处理数据的 rest。

How can I remove only the spaces that occur after a digit, and before a letter?如何仅删除数字之后和字母之前出现的空格?

import pandas as pd
L=['4 d 19 h',
 '6g 23h',
 '6g 23h',
 '6g 23h',
 '6g 23h',
 '5g 18h',
 '5g 11h',
 '4T 20Std',
 '5T 18Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 14Std',
 '1T 14Std',
 '4T 9Std',
 '1T 22Std',
 '20Std 50 Min',
 '9\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '9\xa0j 22\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 18\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '9\xa0j 22\xa0h',
 '9\xa0j 22\xa0h',
 '2\xa0h 8\xa0min',
 '2\xa0h 6\xa0min',
 '2\xa0h 9\xa0min',
 '5\xa0j 23\xa0h',
 '4\xa0j 20\xa0h',
 '4\xa0j 22\xa0h',
 '6\xa0j 20\xa0h',
 '2\xa0h 5\xa0min',
 '4\xa0j 19\xa0h',
 '5\xa0j 23\xa0h',
 '6\xa0j 11\xa0h',
 '3\xa0j 9\xa0h',
 '5\xa0j 16\xa0h',
 '5\xa0j 18\xa0h',
 '21\xa0h 8\xa0min',
 '1\xa0j 20\xa0h',
 '4\xa0j 21\xa0h',
 '3\xa0j 18\xa0h',
 '2\xa0j 17\xa0h',
 '9\xa0j 15\xa0h',
 '5\xa0j 20\xa0h',
 '2\xa0j 19\xa0h',
 '20\xa0h 31\xa0min',
 '5\xa0j 20\xa0h',
 '5\xa0j 19\xa0h',
 '6\xa0j 15\xa0h',
 '5\xa0j 23\xa0h',
 '4\xa0j 17\xa0h',
 '21\xa0h 8\xa0min',
 '4\xa0j 18\xa0h',
 '5\xa0j 20\xa0h',
 '5\xa0j 19\xa0h',
 '6\xa0j 22\xa0h',
 '4\xa0j 19\xa0h',
 '15\xa0h 44\xa0min',
 '9\xa0h 58\xa0min',
 '6\xa0j 22\xa0h',
 '21\xa0h 8\xa0min',
 '7\xa0j 14\xa0h',
 '6\xa0j',
 '5\xa0j 20\xa0h',
 '6\xa0j 17\xa0h',
 '6\xa0j 11\xa0h',
 '9\xa0j 23\xa0h',
 '2\xa0j 17\xa0h',
 '6\xa0j 11\xa0h',
 '5\xa0j 16\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '2\xa0j 8\xa0h',
 '6\xa0j 16\xa0h',
 '4\xa0j 12\xa0h',
 '6\xa0j 14\xa0h',
 '2\xa0j 20\xa0h',
 '3\xa0j 17\xa0h',
 '5\xa0j 20\xa0h',
 '7\xa0j 3\xa0h',
 '5\xa0j 18\xa0h',
 '1\xa0j 20\xa0h',
 '6\xa0j 4\xa0h',
 '1\xa0j 1\xa0h',
 '6\xa0j 21\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 14\xa0h',
 '5\xa0j 2\xa0h',
 '5\xa0j 16\xa0h',
 '9\xa0j 15\xa0h',
 '8\xa0j 23\xa0h',
 '1\xa0j 21\xa0h',
 '6\xa0j 4\xa0h',
 '1\xa0j 16\xa0h',
 '6\xa0j 18\xa0h',
 '18\xa0h 6\xa0min',
 '17\xa0h 38\xa0min',
 '5\xa0j 10\xa0h',
 '9\xa0h 52\xa0min',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 1\xa0h',
 '5\xa0j 17\xa0h',
 '3\xa0j 17\xa0h',
 '1\xa0j 19\xa0h',
 '4\xa0j 22\xa0h',
 '3\xa0j',
 '1\xa0j 17\xa0h',
 '1\xa0j 10\xa0h',
 '4\xa0j 19\xa0h',
 '6\xa0j 7\xa0h',
 '8\xa0j 10\xa0h',
 '2\xa0j 19\xa0h',
 '9\xa0j 18\xa0h',
 '18\xa0h 23\xa0min',
 '5\xa0j 20\xa0h',
 '2\xa0j 16\xa0h',
 '18\xa0h 46\xa0min',
 '1\xa0j 19\xa0h',
 '3\xa0j 2\xa0h',
 '9\xa0j 20\xa0h',
 '9\xa0j 20\xa0h',
 '9\xa0j 20\xa0h',
 '6\xa0j 21\xa0h',
 '3\xa0j 17\xa0h',
 '3\xa0j 17\xa0h',
 '3\xa0j 17\xa0h',
 '2\xa0j 11\xa0h',
 '6\xa0j 1\xa0h',
 '6\xa0j 1\xa0h',
 '5\xa0j 19\xa0h',
 '5\xa0j 20\xa0h',
 '5\xa0j 20\xa0h',
 '4\xa0j 18\xa0h',
 '5\xa0j 20\xa0h',
 '5\xa0j 15\xa0h',
 '5\xa0j 20\xa0h',
 '9\xa0j 18\xa0h',
 '7\xa0j 17\xa0h',
 '6\xa0j 20\xa0h',
 '5\xa0j 10\xa0h',
 '14\xa0h 43\xa0min',
 '6\xa0j 5\xa0h',
 '5\xa0j 19\xa0h',
 '9\xa0j 18\xa0h',
 '1\xa0j 18\xa0h',
 '1\xa0j 18\xa0h',
 '6\xa0j 1\xa0h',
 '6\xa0j 22\xa0h',
 '3\xa0j 10\xa0h',
 '7\xa0j 20\xa0h',
 '3\xa0j 10\xa0h',
 '6\xa0j 19\xa0h',
 '1\xa0j 14\xa0h',
 '6\xa0j 18\xa0h',
 '1\xa0j 12\xa0h',
 '5\xa0j 1\xa0h',
 '2\xa0h 7\xa0min',
 '2\xa0h 7\xa0min',
 '20\xa0h 5\xa0min',
 '5\xa0j 20\xa0h',
 '9\xa0j 7\xa0h',
 '5\xa0j 19\xa0h',
 '15\xa0h 14\xa0min',
 '2\xa0h 5\xa0min',
 '2\xa0h 6\xa0min',
 '2\xa0h 6\xa0min',
 '2\xa0h 7\xa0min',
 '2\xa0h 5\xa0min',
 '8\xa0j 18\xa0h',
 '4\xa0j 10\xa0h',
 '5\xa0j 20\xa0h',
 '2\xa0j 19\xa0h',
 '1\xa0j 20\xa0h',
 '1\xa0j 14\xa0h',
 '6\xa0j 17\xa0h',
 '1\xa0j 22\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 12\xa0h',
 '6\xa0j 10\xa0h',
 '21\xa0h 8\xa0min',
 '5\xa0j 21\xa0h',
 '2\xa0h 10\xa0min',
 '1\xa0j 20\xa0h',
 '5\xa0j 20\xa0h',
 '3\xa0j 22\xa0h',
 '15\xa0h 12\xa0min',
 '6\xa0j 14\xa0h',
 '5\xa0j 20\xa0h',
 '1\xa0j 3\xa0h',
 '5\xa0j 21\xa0h',
 '6\xa0j 21\xa0h',
 '5\xa0j 19\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 19\xa0h',
 '6\xa0j 19\xa0h',
 '7\xa0j 3\xa0h',
 '4\xa0j 18\xa0h',
 '2\xa0j 20\xa0h',
 '5\xa0j 17\xa0h',
 '8\xa0j 18\xa0h',
 '5\xa0j 16\xa0h',
 '5\xa0j 16\xa0h',
 '4\xa0j 17\xa0h',
 '2\xa0j 18\xa0h',
 '16\xa0h 39\xa0min']
DATAFRAME=pd.DataFrame({'TIME':L})

I tried many RegEx expressions, but all remove the middle space as well or expect certain number of spaces.我尝试了很多 RegEx 表达式,但都删除了中间空格或期望有一定数量的空格。

Desired outcome:期望的结果:

'4 d 19 h'       --> '4d 19h'
'6g 23h'         --> '6g 23h'
'20Std 50 Min'   --> '20Std 50Min'
'1\xa0j 18\xa0h' --> '1j 18h',
import re

re.sub(r'(?<=[0-9])[ \u00A0](?=[a-zA-Z])', '', '4 d 19 h')

Edited with an example, may add additional unicode characters from here as you wish like the above example https://jkorpela.fi/chars/spaces.html使用示例进行编辑,可以根据需要从此处添加额外的 unicode 个字符,如上例https://jkorpela.fi/chars/spaces.html

for details on the regexp specifics refer to https://www.regular-expressions.info/lookaround.html有关正则表达式细节的详细信息,请参阅https://www.regular-expressions.info/lookaround.html

If there can only be one space between the digit and the character, I would use the following regular expression: (\d+)[\s|\xa0](\w)如果数字和字符之间只能有一个空格,我会使用以下正则表达式: (\d+)[\s|\xa0](\w)

So, in your code it would be something like this:所以,在你的代码中它会是这样的:

DATAFRAME=pd.DataFrame({'TIME':[re.sub(r"(\d+)[\s|\xa0](\w+)", r"\1\2", l) for l in L]})

Hope it helps.希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM