如何在 python3 中使用带有非捕获组的正则表达式替换？

Question

import pdfplumber, requests, re, io
def pdf_extracted_txt(url, page):
    rq = requests.get(url)
    pdf = pdfplumber.load(io.BytesIO(rq.content))
    txt = pdf.pages[page].extract_text()
    return txt

def remove_noise(txt):
    pattern = r'^.{1,3}$|(^.{1,3})(?:\s[A-Z])|\s+.{1,2}$'
    noiseRegx = re.compile(pattern, flags=re.MULTILINE)
    txt = noiseRegx.sub(r'',txt)
    txt = re.sub('\n+','\n',txt)
    print(txt)

url, page = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0428/2020042800976.pdf', 60
txt = pdf_extracted_txt(url, page)
remove_noise(txt)

我想去除提取文本中的噪音，以便最后三行

PricewaterhouseCoopers
Certified Public Accountants
Hong Kong, 26 March 2020

但是，代码替换了\s[AZ]并且似乎非捕获工作没有任何效果。 当前的 output 是

ricewaterhouseCoopers
ertified Public Accountants
ong Kong, 26 March 2020

这是正则表达式和文本。 任何建议将不胜感激。

Answer 1

模式的第二部分(^.{1,3})(?:\s[AZ])在一个组中捕获任何字符（包括空格）1 - 3 次，然后匹配一个空白字符和一个大写字符。

如果将整个匹配替换为空字符串，空格和大写字符也会被删除，这就是 output 中缺少它们的原因。

根据您希望允许匹配的内容，您可以例如匹配 1-3 次非空白字符并使用正向超前断言直接在右侧的是空白字符和大写字符[AZ]

^.{1,3}$|^\S{1,3}\s(?=[A-Z])|\s+.{1,2}$

正则表达式演示

您也可以保留点，但它也会匹配，例如

Hong Kong, 26 March 2020

如何在 python3 中使用带有非捕获组的正则表达式替换？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-08-10 12:09:40

如何在 python3 中使用带有非捕获组的正则表达式替换？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-08-10 12:09:40

解决方案1
0 已采纳 2020-08-10 12:09:40