繁体   English   中英

通过模式正则表达式删除空格

[英]Delete white spaces by pattern regex

我试图删除一些字符串的空格,如下所示

import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

regex = r's*<|>s*\<|>s*\n+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
    # print(line)
    if line != '':
        print (line.strip())

我有输出,

PERSON_ID<|>|>DEPT_ID<|>|>DATE_JOINED
AAAAA<|>|>S1<|>|>2021/01/03
BBBBBB<|>|>S2<|>|<|>    2021/02/03
CCCCC<|>|>S1<|>|>2021/03/05

我正在寻找一种解决方案来删除输出中日期之前倒数第二行中的空格。

问题是你在>\\s*之后有\\n+ 所以它停止删除\\n处的空格,并将空格留在下一行的开头。 将其更改为>\\s+ ,因为换行符包含在空格中。

import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

regex = r'\s+<|>\s+<|>\s+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
    # print(line)
    if line != '':
        print (line.strip())

想我会尝试一个解决方案,试图保证每一行都是由一个公共分隔符分隔的三列,这不依赖于正则表达式

具体来说,如果您从未知来源获取此数据集,则它不是“常规”,并且没有表达式可以保证其始终是需要修复的日期列

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

lines = [l.strip() for l in test_str.splitlines(keepends=False)]
sep = '<|>'

def get_columns(line, separator):
  return [x.strip() for x in line.split(separator) if x.strip()]

header = get_columns(lines[0], sep)
columns = len(header)

# prepare case if second line needs to join first line
i = 1
lines[i] = sep.join(get_columns(lines[i], sep))

# check remaining lines
while True:
  i += 1
  if i >= len(lines):
    break
  # print(i, lines[i])
  parts = get_columns(lines[i], sep)
  previous_parts = get_columns(lines[i-1], sep)
  num_parts = len(parts)
  # skip lines that shouldn't be relocated
  if num_parts == columns or len(previous_parts) == columns:
    continue
  else:
    if len(parts) + len(previous_parts) > columns:
      # TODO: handle case where this line has all columns and part of the previous line
      pass  
    previous_parts.extend(parts)
    lines[i-1] = sep.join(previous_parts)
    del lines[i]  # remove the current line that was moved to the previous

for l in lines:
  print(l)

样本输出

PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>2021/02/03
CCCCC<|>S1<|>2021/03/05

相同的输出

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>
    2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>
    S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM