[英]How to capture all lines but avoid lines containing a particular word or pattern? Is capturing with negative lookaheads possible?
數據來自 pdf 中的表格,該表格通過轉換為文本(使用 pdftotext)提取。 這是我試圖從中Total
district1
中所有行的Call
數據或Office
。
也應用了一個DOTALL標志。 我試圖像這樣在 python 中捕獲: re.findall(r'District.*icts(.*Total.*?\n|\r)',input,re.DOTALL)
District.*icts(.*Total.*?\n|\r)
以上捕獲(不僅僅是匹配) district1
和Total
(包括)之間的所有內容。 但我想刪除捕獲的行或不捕獲包含關鍵字Call
或Office
的行。 所以我嘗試應用否定的前瞻,但它也沒有工作:
District.*icts(((?!Call|Office|^\n).)*Total.*?\n|\r)
一整天都被這個問題困擾。 我對忽略這些行並捕獲 rest 沒有任何其他想法。 任何幫助,將不勝感激。
---dont capture this line----
District No. of positive cases admitted Other Districts
district1 7 1 district4
district2 6
district3 7 -
Call Centre:12323, 132123
Office:212332122 , 1056
district4 131
Total 263
---dont capture this line----
---dont capture this line----
District No. of positive cases admitted Other Districts
district1 7 1 district4
district2 6
Call Centre:12323, 132123
Office:212332122 , 1056
district3 7 -
district4 131
Total 263
---dont capture this line----
---dont capture this line----
District No. of positive cases admitted Other Districts
Call Centre:12323, 132123
district1 7 1 district4
district2 6
Office:212332122 , 1056
district3 7 -
district4 131
Total 263
---dont capture this line----
district1 7 1 district4
district2 6
district3 7 -
district4 131
Total 263
最簡單的方法可能不是使用正則表達式。 像這樣的東西應該很好用:
KEY_WORDS = ["district", "Total"]
def filter_pdf(doc):
buffer = ''
for line in doc.split("\n"):
temp_line = line.strip() # Remove trailing whitespace
for word in KEY_WORDS:
if temp_line.startswith(word):
buffer += line + "\n"
break
return buffer
這給了您的 output:
>>> doc = """
---dont capture this line----
District No. of positive cases admitted Other Districts
district1 7 1 district4
district2 6
district3 7 -
Call Centre:12323, 132123
Office:212332122 , 1056
district4 131
Total 263
---dont capture this line----
"""
>>> cleaned = filter_pdf(doc)
>>> print(cleaned)
district1 7 1 district4
district2 6
district3 7 -
district4 131
Total 263
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.