簡體   English   中英

python regex - 在模式后選擇單詞

[英]python regex - select words after pattern

文本:

  1. 管理、食品員工和有條件的員工; 知識、責任和報告 - 評論:現場沒有書面的員工健康政策。 必須為所有員工提供。 優先基金會違規 7-38-010 引文已發出。 | 5. 嘔吐和腹瀉事件的應對程序 - 評論:沒有書面的清潔程序或嘔吐/腹瀉事件所需的設備。 必須提供。 優先基金會違規 7-38-005 引文已發出。 | 25. 為生食/未煮熟的食物提供的消費者建議 - 評論:菜單不會向消費者披露和告知消費者生的或未煮熟的特定菜單項目以及食用此類食物的潛在危險。 必須提供向客戶披露和提醒此類項目的消費者建議。 優先基金會違規。 沒有引用。 | 38.

題:

正文部分包括第 3、5、25 和 38 節(后跟起始索引)。 我想從“- Comments:”之后和下一部分的起始索引之前的一個部分中提取所有文本。

def comments(x):
    result = []
    for elem in df['Violations']:
        matches = re.findall(r'\d+\. (.*?)(?: - |\r?\n|$)', elem)
        result.extend(matches)
    print(result)

附加的代碼正在執行完全相反的提取,它只提取“- Comments:”之前的單詞,我該如何更改它?

非常感謝

如果您想在Comments:|之間添加文本然后在正則表達式中使用這些值。

'Comments: ([^\|]*) \|'

它使用()只捕獲Comments:|之間的字符| 但不一樣| (見[^\\|] )。

因為| 在正則表達式中具有特殊含義,所以我使用\\| 將其用作文本中的普通字符。


或者

'Comments: (.*?) \|'

哪個使用? 獲得不同的字符|


import re

elem = '''MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.'''

#matches = re.findall('Comments: ([^\|]*) \|', elem)
matches = re.findall('Comments: (.*?) \|', elem)

#print(matches)

for item in matches:
    print(item)
    print('---')

結果:

NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.
---
NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.
---
MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.

您的模式在- 、換行符或字符串結尾之前捕獲組中盡可能少的文本,並且不匹配任何帶有Comments:部分Comments:

您可以通過匹配評論來更改它,並為其后的文本添加一個捕獲組

\d+\. .*?(?: - Comments:\s*)(.*?)(?: \||$)

正則表達式演示

更精確的匹配可能是匹配每個文本的開頭,即數字、點和空格,然后匹配直到第一次出現 -Comments: 而不跨越另一個文本的開頭。

在 Comments 之后,您可以使用捕獲組來捕獲直到文本的下一次出現,或者如果它是最后一個則斷言字符串的結尾。

使用 re.findall 將返回捕獲組 1 的值。

\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)

模式匹配:

  • \\b防止部分詞匹配的詞邊界
  • \\d+\\. 匹配 1+ 個數字、一個點和一個空格
  • (?:(?!\\d+\\. |- Comments:).)*如果直接在右側沒有模式\\d+\\. ,則匹配任何字符\\d+\\. - Comments
  • - Comments:\\s*匹配- Comments:后跟可選的空白字符
  • (.*?)捕獲組 1,盡可能匹配任何字符
  • (?: \\||$)匹配

正則表達式演示| Python 演示

例子

import re

regex = r"\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)"

s = "3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.  | 38. "

print(re.findall(regex, s))

輸出

[
'NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.', 
'NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.', 
'MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. '
]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM