[英]RegEx for capturing part of a string
I am trying to grab top level Markdown headings (ie, headings beginning with a single hash -- # Introduction) in an .md doc with Python's re library and cannot for the life of me figure this out. 我试图在一个带有Python的库的.md文档中获取顶级Markdown标题(即标题以单个哈希开头 - #Introduction),并且在我的生活中无法想到这一点。
Here is the code I'm trying to execute: 这是我正在尝试执行的代码:
import re
pattern = r"(# .+?\\n)"
text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text)
print(header.string)
The result from the print(header.string)
is: print(header.string)
的结果是:
# Title\\n## Chapter\\n### sub-chapter#### What a lovely day.\\n
whereas I only want # Title\\n
# Title\\n## Chapter\\n### sub-chapter#### What a lovely day.\\n
而我只想要# Title\\n
This example on regex101 says it should work, but I can't figure out why it isn't. regex101上的这个例子说它应该可以工作,但我无法弄清楚它为什么不行。 https://regex101.com/r/u4ZIE0/9
https://regex101.com/r/u4ZIE0/9
You get that result because you use header.string
which is calling .string on a Match object which will give you back the string passed to match()
or search ()
. 你得到那个结果是因为你使用了
header.string
,它在一个Match对象上调用.string ,它会返回传递给match()
或search ()
的字符串。
The string already has newlines in it: 字符串中已经有换行符:
text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
So if you use your pattern (note that it will also match the newline), you could update your code to: 因此,如果您使用您的模式(请注意它也将与换行符匹配),您可以将代码更新为:
import re
pattern = r"(# .+?\\n)"
text = r"# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text)
print(header.group())
Note that re.search looks for the first location where the regex produces a match. 请注意, re.search会查找正则表达式生成匹配项的第一个位置。
Another option to match your value could be matching from the start of the string a #
followed by a space and then any character except a newline until the end of the string: 以符合你的价值的另一个选项是从字符串的开头来匹配
#
后跟换行符以外,直到字符串末尾一个空格,然后任意字符:
^# .*$
For example: 例如:
import re
pattern = r"^# .*$"
text = "# Title\n## Chapter\n### sub-chapter#### What a lovely day.\n"
header = re.search(pattern, text, re.M)
print(header.group())
If there can not be any more #
following after, you might also use a negated character class to match not a #
or a newline: 如果之后不再有
#
,那么您也可以使用否定的字符类来匹配#
或换行符:
^# [^#\n\r]+$
I'm guessing that we are wishing to extract the # Title\\n
, which in that case, your expression seems to be working fine with a slight modification: 我猜我们希望提取
# Title\\n
,在这种情况下,你的表达式似乎工作正常,略有修改:
(# .+?\\n)(.+)
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(# .+?\\n)(.+)"
test_str = "# Title\\n## Chapter\\n### sub-chapter#### The Bar\\nIt was a fall day.\\n"
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 1)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.