[英]How do I search for a pattern with different numbers of white spaces using regex?
[英]Delete white spaces by pattern regex
我試圖刪除一些字符串的空格,如下所示
import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>
2021/02/03
CCCCC<|>S1<|>2021/03/05'''
regex = r's*<|>s*\<|>s*\n+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
# print(line)
if line != '':
print (line.strip())
我有輸出,
PERSON_ID<|>|>DEPT_ID<|>|>DATE_JOINED
AAAAA<|>|>S1<|>|>2021/01/03
BBBBBB<|>|>S2<|>|<|> 2021/02/03
CCCCC<|>|>S1<|>|>2021/03/05
我正在尋找一種解決方案來刪除輸出中日期之前倒數第二行中的空格。
問題是你在>\\s*
之后有\\n+
。 所以它停止刪除\\n
處的空格,並將空格留在下一行的開頭。 將其更改為>\\s+
,因為換行符包含在空格中。
import re
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>
2021/02/03
CCCCC<|>S1<|>2021/03/05'''
regex = r'\s+<|>\s+<|>\s+'
subst = "<|>"
# print(re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'))
for line in re.sub(regex, subst, test_str, 0, re.MULTILINE).split('\n'):
# print(line)
if line != '':
print (line.strip())
想我會嘗試一個解決方案,試圖保證每一行都是由一個公共分隔符分隔的三列,這不依賴於正則表達式
具體來說,如果您從未知來源獲取此數據集,則它不是“常規”,並且沒有表達式可以保證其始終是需要修復的日期列
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>
2021/02/03
CCCCC<|>S1<|>2021/03/05'''
lines = [l.strip() for l in test_str.splitlines(keepends=False)]
sep = '<|>'
def get_columns(line, separator):
return [x.strip() for x in line.split(separator) if x.strip()]
header = get_columns(lines[0], sep)
columns = len(header)
# prepare case if second line needs to join first line
i = 1
lines[i] = sep.join(get_columns(lines[i], sep))
# check remaining lines
while True:
i += 1
if i >= len(lines):
break
# print(i, lines[i])
parts = get_columns(lines[i], sep)
previous_parts = get_columns(lines[i-1], sep)
num_parts = len(parts)
# skip lines that shouldn't be relocated
if num_parts == columns or len(previous_parts) == columns:
continue
else:
if len(parts) + len(previous_parts) > columns:
# TODO: handle case where this line has all columns and part of the previous line
pass
previous_parts.extend(parts)
lines[i-1] = sep.join(previous_parts)
del lines[i] # remove the current line that was moved to the previous
for l in lines:
print(l)
樣本輸出
PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>2021/01/03
BBBBBB<|>S2<|>2021/02/03
CCCCC<|>S1<|>2021/03/05
相同的輸出
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>S1<|>
2021/01/03
BBBBBB<|>S2<|>
2021/02/03
CCCCC<|>S1<|>2021/03/05'''
和
test_str = '''PERSON_ID<|>DEPT_ID<|>DATE_JOINED
AAAAA<|>
S1<|>2021/01/03
BBBBBB<|>S2<|>
2021/02/03
CCCCC<|>S1<|>2021/03/05'''
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.