简体   繁体   English

RegEx用于捕获字符串的一部分

[英]RegEx for capturing part of a string

I am trying to grab top level Markdown headings (ie, headings beginning with a single hash -- # Introduction) in an .md doc with Python's re library and cannot for the life of me figure this out. 我试图在一个带有Python的库的.md文档中获取顶级Markdown标题(即标题以单个哈希开头 - #Introduction),并且在我的生活中无法想到这一点。

Here is the code I'm trying to execute: 这是我正在尝试执行的代码:

import re

pattern = r"(# .+?\\n)"

text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"

header = re.search(pattern, text)
print(header.string)

The result from the print(header.string) is: print(header.string)的结果是:

# Title\\n## Chapter\\n### sub-chapter#### What a lovely day.\\n whereas I only want # Title\\n # Title\\n## Chapter\\n### sub-chapter#### What a lovely day.\\n而我只想要# Title\\n

This example on regex101 says it should work, but I can't figure out why it isn't. regex101上的这个例子说它应该可以工作,但我无法弄清楚它为什么不行。 https://regex101.com/r/u4ZIE0/9 https://regex101.com/r/u4ZIE0/9

You get that result because you use header.string which is calling .string on a Match object which will give you back the string passed to match() or search () . 你得到那个结果是因为你使用了header.string ,它在一个Match对象上调用.string ,它会返回传递给match()或search ()的字符串。

The string already has newlines in it: 字符串中已经有换行符:

text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"

So if you use your pattern (note that it will also match the newline), you could update your code to: 因此,如果您使用您的模式(请注意它也将与换行符匹配),您可以将代码更新为:

import re

pattern = r"(# .+?\\n)"
text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text)
print(header.group())

Python demo Python演示

Note that re.search looks for the first location where the regex produces a match. 请注意, re.search会查找正则表达式生成匹配项的第一个位置。

Another option to match your value could be matching from the start of the string a # followed by a space and then any character except a newline until the end of the string: 以符合你的价值的另一个选项是从字符串的开头来匹配#后跟换行符以外,直到字符串末尾一个空格,然后任意字符:

^# .*$

For example: 例如:

import re

pattern = r"^# .*$"
text = "# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text, re.M)
print(header.group())

Python demo Python演示

If there can not be any more # following after, you might also use a negated character class to match not a # or a newline: 如果之后不再有# ,那么您也可以使用否定的字符类来匹配#或换行符:

^# [^#\n\r]+$

I'm guessing that we are wishing to extract the # Title\\n , which in that case, your expression seems to be working fine with a slight modification: 我猜我们希望提取# Title\\n ,在这种情况下,你的表达式似乎工作正常,略有修改:

(# .+?\\n)(.+)

DEMO DEMO

Test 测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(# .+?\\n)(.+)"

test_str = "# Title\\n## Chapter\\n### sub-chapter#### The Bar\\nIt was a fall day.\\n"

subst = "\\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 1)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM