[英]How can I match a reStructuredText code block with Regex and Python?
I am trying to extract a code block
from a .rst
document using Python and regex . 我正在尝试使用Python和regex从
.rst
文档中提取code block
。 The code blocks in the document are defined by adding a .. code-block:: python
directive to the text and then indenting by a few spaces. 通过在文本中添加
.. code-block:: python
指令,然后缩进一些空格来定义文档中的代码块。
Here is an example from my test document: 这是我的测试文档中的一个示例:
.. code-block:: python
import os
from selenium import webdriver
from axe_selenium_python import Axe
def test_google():
driver = webdriver.Firefox()
driver.get("http://www.google.com")
axe = Axe(driver)
# Inject axe-core javascript into page.
axe.inject()
# Run axe accessibility checks.
results = axe.execute()
# Write results to file
axe.write_results(results, 'a11y.json')
driver.close()
# Assert no violations are found
assert len(results["violations"]) == 0, axe.report(results["violations"])
driver.close()
So far I have this regex: (\\.\\. code-block:: python\\s\\s)(.*\\s.+).*?\\n\\s+(.*\\s.+)+
到目前为止,我有这个正则表达式:
(\\.\\. code-block:: python\\s\\s)(.*\\s.+).*?\\n\\s+(.*\\s.+)+
The problem with this pattern is that it selects only the first part and last part of the test string. 这种模式的问题在于它仅选择测试字符串的第一部分和最后一部分。 I need help in writing a pattern that can capture everything within the
.. code-block:: python
code block excluding the ..code-block:: python
directive. 在编写可以捕获
.. code-block:: python
代码块(不包括..code-block:: python
指令)的所有内容的模式时,我需要帮助。
You can see the progress I have made with this here . 您可以在这里查看我的进度。
If you insist on using regex, the following should do the trick, given provided example: 如果您坚持使用正则表达式,则应按照以下提供的示例进行操作:
import re
pattern = r"(\.\. code-block:: python\s+$)((\n +.*|\s)+)"
matches = re.finditer(pattern, text, re.M)
for m, match in enumerate(matches):
for g, group_text in enumerate(match.groups()):
print("###match {}, group {}:###".format(m, g))
print(group_text, end="")
The trick, I believe, is to use nested parenthesis and the MULTILINE or M flag. 我相信,诀窍是使用嵌套括号和MULTILINE或M标志。
The resulting match
object(s) will have 3 groups , as defined by the parenthesis: 产生的
match
对象将具有3 组 ,如括号中所定义:
To retrieve group n
, use match.group(n)
. 要检索组
n
,请使用match.group(n)
。 Note that indexing of groups starts at 1
and passing 0
or no arguments will result in the entire matching string. 请注意,组的索引从
1
开始,传递0
或不传递任何参数将导致整个匹配的字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.