如何捕獲所有行但避免包含特定單詞或模式的行？是否可以使用負前瞻進行捕獲？

Question

數據來自 pdf 中的表格，該表格通過轉換為文本（使用 pdftotext）提取。 這是我試圖從中Total district1中所有行的Call數據或Office 。

我試過的正則表達式

也應用了一個DOTALL標志。 我試圖像這樣在 python 中捕獲： re.findall(r'District.*icts(.*Total.*?\n|\r)',input,re.DOTALL)

District.*icts(.*Total.*?\n|\r)

以上捕獲（不僅僅是匹配） district1和Total （包括）之間的所有內容。 但我想刪除捕獲的行或不捕獲包含關鍵字Call或Office的行。 所以我嘗試應用否定的前瞻，但它也沒有工作：

District.*icts(((?!Call|Office|^\n).)*Total.*?\n|\r)

一整天都被這個問題困擾。 我對忽略這些行並捕獲 rest 沒有任何其他想法。 任何幫助，將不勝感激。

輸入的可能變化

---dont capture this line----
            District    No. of positive cases admitted        Other Districts
district1                           7                        1 district4
district2                           6
district3                           7                         -


             Call Centre:12323, 132123
                                   Office:212332122 , 1056
  district4                           131
        Total                       263
---dont capture this line----

---dont capture this line----
            District    No. of positive cases admitted        Other Districts
district1                           7                        1 district4
district2                           6


             Call Centre:12323, 132123
                                   Office:212332122 , 1056
district3                           7                         -

  district4                           131
        Total                       263
---dont capture this line----

---dont capture this line----
            District    No. of positive cases admitted        Other Districts
             Call Centre:12323, 132123
district1                           7                        1 district4
district2                           6



                                   Office:212332122 , 1056
district3                           7                         -

  district4                           131
        Total                       263
---dont capture this line----

需要捕獲

district1                           7                        1 district4
district2                           6
district3                           7                         -
  district4                           131
        Total                       263

Answer 1

最簡單的方法可能不是使用正則表達式。 像這樣的東西應該很好用：

KEY_WORDS = ["district", "Total"]


def filter_pdf(doc):
    buffer = ''
    for line in doc.split("\n"):
        temp_line = line.strip()  # Remove trailing whitespace
        for word in KEY_WORDS:
            if temp_line.startswith(word):
                buffer += line + "\n"
                break
    return buffer

這給了您的 output：

>>> doc = """
---dont capture this line----
            District    No. of positive cases admitted        Other Districts
district1                           7                        1 district4
district2                           6
district3                           7                         -


             Call Centre:12323, 132123
                                   Office:212332122 , 1056
  district4                           131
        Total                       263
---dont capture this line----
"""
>>> cleaned = filter_pdf(doc)
>>> print(cleaned)
district1                           7                        1 district4
district2                           6
district3                           7                         -
  district4                           131
        Total                       263

如何捕獲所有行但避免包含特定單詞或模式的行？是否可以使用負前瞻進行捕獲？

問題描述

我試過的正則表達式

輸入的可能變化

需要捕獲

1 個解決方案

解決方案1
0 已采納 2020-04-07 17:30:38

如何捕獲所有行但避免包含特定單詞或模式的行？ 是否可以使用負前瞻進行捕獲？

問題描述

我試過的正則表達式

輸入的可能變化

需要捕獲

1 個解決方案

解決方案1 0 已采納 2020-04-07 17:30:38

如何捕獲所有行但避免包含特定單詞或模式的行？是否可以使用負前瞻進行捕獲？

解決方案1
0 已采納 2020-04-07 17:30:38