简体   繁体   English

使用带有重复结束字的正则表达式提取字符串的一部分

[英]Extracting a section of a string using regex with repeating ending words

I am attempting to extract some some raw strings using re module in python. 我试图在python中使用re模块提取一些原始字符串。 The end of a to-be-extracted section is identified by a repeating word (repeated multiple times), Current efforts always captures the last match of the repeating word. 待提取部分的结束由重复的单词识别(重复多次),当前的努力总是捕获重复单词的最后一个匹配。 How can I modify this behavior? 我该如何修改此行为?

A textfile has been extracted from a pdf. 已从pdf中提取文本文件。 The entire PDF is stored as one string. 整个PDF存储为一个字符串。 A general formatting of the string is as below: 字符串的一般格式如下:

*"***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"* *“***开始注释:字母数字字符和字符的集合EndofsectionTopic A:字符串Endofsection”*

The intended string to be captured is: "Collection of alphanumeric words and characters" 要捕获的预期字符串是: “字母数字单词和字符的集合”

The attempted solution used in this situation was: " re.compile(r"*{3}Start of notes:(.+)\\sEndofsection") 在这种情况下使用的尝试解决方案是:“re.compile(r”* {3}开始注释:(。+)\\ sEndofsection“)

This attempt tends to match the whole string rather than just "Collection of alphanumeric words and characters" as intended. 这种尝试倾向于匹配整个字符串,而不仅仅是“字母数字单词和字符的集合”。

One possible approach is to split with Endofsection and then extract the string from the first section only - this works, but I was hoping to find a more elegant solution using re.compile. 一种可能的方法是使用Endofsection进行拆分,然后仅从第一部分提取字符串 - 这可行,但我希望使用re.compile找到更优雅的解决方案。

Two problems in your regex, 正则表达式中的两个问题,

  • You need to escape * as it is a meta character as \\* 您需要转义*因为它是元字符\\*
  • Second, you are using (.+) which is a greedy quantifier and will try matching as much as possible, but since you want the shortest match, you need to just change it to (.+?) 其次,你正在使用(.+)这是一个贪婪的量词,并会尽可能地尝试匹配,但由于你想要最短的匹配,你需要将它改为(.+?)

Fixing these two issues, gives you the correct intended match. 修复这两个问题,为您提供正确的预期匹配。

Regex Demo 正则表达式演示

Python code, Python代码,

import re

s = "***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"
m = re.search(r'\*{3}Start of notes:(.+?)\sEndofsection', s)
if m:
 print(m.group(1))

Prints, 打印,

 Collection of alphanumeric words and characters

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM