如何将reStructuredText代码块与Regex和Python匹配？

Question

I am trying to extract a code block from a .rst document using Python and regex . 我正在尝试使用Python和regex从.rst文档中提取code block 。 The code blocks in the document are defined by adding a .. code-block:: python directive to the text and then indenting by a few spaces. 通过在文本中添加.. code-block:: python指令，然后缩进一些空格来定义文档中的代码块。

Here is an example from my test document: 这是我的测试文档中的一个示例：

.. code-block:: python

  import os
  from selenium import webdriver
  from axe_selenium_python import Axe

  def test_google():
      driver = webdriver.Firefox()
      driver.get("http://www.google.com")
      axe = Axe(driver)
      # Inject axe-core javascript into page.
      axe.inject()
      # Run axe accessibility checks.
      results = axe.execute()
      # Write results to file
      axe.write_results(results, 'a11y.json')
      driver.close()
      # Assert no violations are found
      assert len(results["violations"]) == 0,    axe.report(results["violations"])
      driver.close()

So far I have this regex: (\\.\\. code-block:: python\\s\\s)(.*\\s.+).*?\\n\\s+(.*\\s.+)+ 到目前为止，我有这个正则表达式： (\\.\\. code-block:: python\\s\\s)(.*\\s.+).*?\\n\\s+(.*\\s.+)+

The problem with this pattern is that it selects only the first part and last part of the test string. 这种模式的问题在于它仅选择测试字符串的第一部分和最后一部分。 I need help in writing a pattern that can capture everything within the .. code-block:: python code block excluding the ..code-block:: python directive. 在编写可以捕获.. code-block:: python代码块（不包括..code-block:: python指令）的所有内容的模式时，我需要帮助。

You can see the progress I have made with this here . 您可以在这里查看我的进度。

Answer 1

If you insist on using regex, the following should do the trick, given provided example: 如果您坚持使用正则表达式，则应按照以下提供的示例进行操作：

import re

pattern = r"(\.\. code-block:: python\s+$)((\n +.*|\s)+)"

matches = re.finditer(pattern, text, re.M)

for m, match in enumerate(matches):
    for g, group_text in enumerate(match.groups()):
        print("###match {}, group {}:###".format(m, g))
        print(group_text, end="")

The trick, I believe, is to use nested parenthesis and the MULTILINE or M flag. 我相信，诀窍是使用嵌套括号和MULTILINE或M标志。

The resulting match object(s) will have 3 groups , as defined by the parenthesis: 产生的match对象将具有3 组，如括号中所定义：

group 1: the '.. code-block:' header 组1：“ ..代码块：”标头
group 2: the contents of the code block 第2组：代码块的内容
group 3: an empty group as a result of the extra grouping parenthesis. 组3：由于多余的分组括号，因此为空组。

To retrieve group n , use match.group(n) . 要检索组n ，请使用match.group(n) 。 Note that indexing of groups starts at 1 and passing 0 or no arguments will result in the entire matching string. 请注意，组的索引从1开始，传递0或不传递任何参数将导致整个匹配的字符串。

如何将reStructuredText代码块与Regex和Python匹配？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-10-31 14:00:09

如何将reStructuredText代码块与Regex和Python匹配？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-10-31 14:00:09

解决方案1
0 已采纳 2018-10-31 14:00:09