正則表達式從文本中刪除

Question

我正在嘗試清理用於機器學習應用程序的文本。 基本上，這些是“半結構化”的規范文檔，我正在嘗試刪除與NLTK sent_tokenize()函數混淆的部分編號。

這是我正在處理的文本的示例：

and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3

...

(b)

until thirty-five days after the time fixed for receiving this tender,

whichever first occurs.
2.4

AGREEMENT

Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.

我正在嘗試刪除所有分節符（例如2.3.3、2.4，（b）），但不刪除日期數字。

這是我到目前為止使用的正則表達式： [0-9]*\\.[0-9]|[0-9]\\.

不幸的是，它與上一段中的部分日期匹配（2019年。變為201），我真的不知道如何解決此問題。

謝謝你的幫助！

Answer 1

您可以嘗試將以下模式替換為空字符串

((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))

output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)

此模式通過將段號匹配為\\d+(?:\\.\\d+)*起作用，但前提是它出現在行的開頭。 它還將字母節標題匹配為\\([az]+\\) 。

Answer 2

您嘗試了[0-9]*\\.[0-9]|[0-9]\\. 不固定，將匹配0+個數字，一個點和一個數字或| 一個數字和一個點

它不考慮括號之間的匹配。

假設分節符位於字符串的開始，也許會用空格或制表符前面，你可以更新與您的模式交替至：

^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))

^字符串開頭
[\\t ]*匹配0+次空格或制表符
(?:非捕獲組
- \\d+(?:\\.\\d+)+匹配1+個數字並重復1+次一個點和1+個數字以匹配至少一個點以匹配2.3.3或2.4
- |
- \\([az]+\\)括號之間匹配1+次az
)關閉非捕獲組

正則表達式演示 | Python演示

例如，使用re.MULTILINE whers s是您的字符串：

pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)

Answer 3

對於您的特定情況，我認為\\n[\\d+\\.]+|\\n\\(\\w\\)應該有效。 \\n有助於區分該部分。

正則表達式從文本中刪除

問題描述

3 個解決方案

解決方案1
3 已采納 2019-07-13 15:00:19

解決方案2
1 2019-07-13 15:45:27

解決方案3
0 2019-07-13 14:54:23

正則表達式從文本中刪除

問題描述

3 個解決方案

解決方案1 3 已采納 2019-07-13 15:00:19

解決方案2 1 2019-07-13 15:45:27

解決方案3 0 2019-07-13 14:54:23

解決方案1
3 已采納 2019-07-13 15:00:19

解決方案2
1 2019-07-13 15:45:27

解決方案3
0 2019-07-13 14:54:23