簡體   English   中英

通過模式正則表達式刪除空格

[英]Delete white spaces by pattern regex

我試圖刪除一些字符串的空格,如下所示

import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

regex = r's*<|>s*\<|>s*\n+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
    # print(line)
    if line != '':
        print (line.strip())

我有輸出,

PERSON_ID<|>|>DEPT_ID<|>|>DATE_JOINED
AAAAA<|>|>S1<|>|>2021/01/03
BBBBBB<|>|>S2<|>|<|>    2021/02/03
CCCCC<|>|>S1<|>|>2021/03/05

我正在尋找一種解決方案來刪除輸出中日期之前倒數第二行中的空格。

問題是你在>\\s*之后有\\n+ 所以它停止刪除\\n處的空格,並將空格留在下一行的開頭。 將其更改為>\\s+ ,因為換行符包含在空格中。

import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

regex = r'\s+<|>\s+<|>\s+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
    # print(line)
    if line != '':
        print (line.strip())

想我會嘗試一個解決方案,試圖保證每一行都是由一個公共分隔符分隔的三列,這不依賴於正則表達式

具體來說,如果您從未知來源獲取此數據集,則它不是“常規”,並且沒有表達式可以保證其始終是需要修復的日期列

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

lines = [l.strip() for l in test_str.splitlines(keepends=False)]
sep = '<|>'

def get_columns(line, separator):
  return [x.strip() for x in line.split(separator) if x.strip()]

header = get_columns(lines[0], sep)
columns = len(header)

# prepare case if second line needs to join first line
i = 1
lines[i] = sep.join(get_columns(lines[i], sep))

# check remaining lines
while True:
  i += 1
  if i >= len(lines):
    break
  # print(i, lines[i])
  parts = get_columns(lines[i], sep)
  previous_parts = get_columns(lines[i-1], sep)
  num_parts = len(parts)
  # skip lines that shouldn't be relocated
  if num_parts == columns or len(previous_parts) == columns:
    continue
  else:
    if len(parts) + len(previous_parts) > columns:
      # TODO: handle case where this line has all columns and part of the previous line
      pass  
    previous_parts.extend(parts)
    lines[i-1] = sep.join(previous_parts)
    del lines[i]  # remove the current line that was moved to the previous

for l in lines:
  print(l)

樣本輸出

PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>2021/02/03
CCCCC<|>S1<|>2021/03/05

相同的輸出

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>S1<|>
    2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
    AAAAA<|>
    S1<|>2021/01/03
    BBBBBB<|>S2<|>
    2021/02/03
    CCCCC<|>S1<|>2021/03/05'''

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM