简体   繁体   English

通过模式正则表达式删除空格

[英]Delete white spaces by pattern regex

I tried to remove the white spaces of some strings, as below我试图删除一些字符串的空格,如下所示

import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

regex = r's*<|>s*\<|>s*\n+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
    # print(line)
    if line != '':
        print (line.strip())

I've got the output,我有输出,

PERSON_ID<|>|>DEPT_ID<|>|>DATE_JOINED
AAAAA<|>|>S1<|>|>2021/01/03
BBBBBB<|>|>S2<|>|<|>    2021/02/03
CCCCC<|>|>S1<|>|>2021/03/05

I am looking for a solution to remove the white spaces in the second last line before the date in the output.我正在寻找一种解决方案来删除输出中日期之前倒数第二行中的空格。

The problem is that you have \\n+ after >\\s* .问题是你在>\\s*之后有\\n+ So it stops removing whitespace at the \\n , and leaves the spaces at the beginning of the next line.所以它停止删除\\n处的空格,并将空格留在下一行的开头。 Change that to >\\s+ , since newline is included in whitespace.将其更改为>\\s+ ,因为换行符包含在空格中。

import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

regex = r'\s+<|>\s+<|>\s+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
    # print(line)
    if line != '':
        print (line.strip())

Figured I'd try a solution that tries to guarantee each row as three columns separated by a common separator, that doesn't depend on regex想我会尝试一个解决方案,试图保证每一行都是由一个公共分隔符分隔的三列,这不依赖于正则表达式

Specifically, if you're getting this dataset from an unknown source, then it isn't "regular" and there's no expression that can guarantee its always the date column that needs fixed具体来说,如果您从未知来源获取此数据集,则它不是“常规”,并且没有表达式可以保证其始终是需要修复的日期列

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

lines = [l.strip() for l in test_str.splitlines(keepends=False)]
sep = '<|>'

def get_columns(line, separator):
  return [x.strip() for x in line.split(separator) if x.strip()]

header = get_columns(lines[0], sep)
columns = len(header)

# prepare case if second line needs to join first line
i = 1
lines[i] = sep.join(get_columns(lines[i], sep))

# check remaining lines
while True:
  i += 1
  if i >= len(lines):
    break
  # print(i, lines[i])
  parts = get_columns(lines[i], sep)
  previous_parts = get_columns(lines[i-1], sep)
  num_parts = len(parts)
  # skip lines that shouldn't be relocated
  if num_parts == columns or len(previous_parts) == columns:
    continue
  else:
    if len(parts) + len(previous_parts) > columns:
      # TODO: handle case where this line has all columns and part of the previous line
      pass  
    previous_parts.extend(parts)
    lines[i-1] = sep.join(previous_parts)
    del lines[i]  # remove the current line that was moved to the previous

for l in lines:
  print(l)

Sample output样本输出

PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>2021/02/03
CCCCC<|>S1<|>2021/03/05

Same output for相同的输出

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>
    2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

and

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>
    S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM