简体   繁体   中英

Tell me a regular expression to find the data from the tag from HTML

Tell me a regular expression to find the data from the tag pre

<div>
    <pre>
        Need to get this below block !!!

        Example of an env file:

        <pre>
        !![](./raw/IndisputableInsertCodeFromFile.env)
        </pre>

        After parsing this file, we will get a Python dictionary:

        ```
        !![](../out/IndisputableInsertCodeFromFile.py)
        ```

        Script code:

        ```
        !![](./raw/ENV_IndisputableInsertCodeFromFile.py)
        ```
    </pre>
</div>

required result:

Need to get this below block !!!

Example of an env file:

<pre>
!![](./raw/IndisputableInsertCodeFromFile.env)
</pre>

After parsing this file, we will get a Python dictionary:

```

!![](../out/IndisputableInsertCodeFromFile.py)
```

Script code:

```
!![](./raw/ENV_IndisputableInsertCodeFromFile.py)
```   

I have tried such regular expressions, but they don't suit me

(?<=\<pre\>)(\s*.*\s*)(?=\<\/pre\>)
(?<=<pre>)(?P<body>\n*(?:.\s*(?!\/pre>))+\n*)

import re
from typing import Optional

text = """
<div>
    <pre>
        Need to get this below block !!!111

        Example of an env file:

        <pre>
        !![](./raw/IndisputableInsertCodeFromFile.env)
        </pre>

        After parsing this file, we will get a Python dictionary:

        ```
        !![](../out/IndisputableInsertCodeFromFile.py)
        ```

        Script code:

        ```
        !![](./raw/ENV_IndisputableInsertCodeFromFile.py)
        ```
    </pre>
        <pre>
        Need to get this below block !!!2222

        Example of an env file:

        <pre>
        !![](./raw/IndisputableInsertCodeFromFile.env)
        </pre>

        After parsing this file, we will get a Python dictionary:

        ```
        !![](../out/IndisputableInsertCodeFromFile.py)
        ```

        Script code:

        ```
        !![](./raw/ENV_IndisputableInsertCodeFromFile.py)
        ```
    </pre>
</div>
"""


def ParseTag(name_tag: str = 'pre'):
    str_start = f'<{name_tag}>'
    start_tag = list(str_start)
    len_start = len(start_tag)

    end_tag = list(f'</{name_tag}>')
    len_end = len(end_tag)

    # [(ТекстТега,Старт,Стоп)]
    res_list: list[tuple[str, int, int]] = []

    def _self(_text: str, last_start: int = 0):
        """Рекурсивная функция поиска вложенных тегов"""
        tmp: list[str] = []
        start_l: list[tuple[int, int]] = []
        end_l: list[tuple[int, int]] = []
        re_str: Optional[re.Match] = re.search(str_start, _text)
        if re_str:
            for i, symbl in enumerate(_text[re_str.start():]):
                # Ищем начальные теги
                if tmp[-len_start:] == start_tag:
                    start_l.append((i - len_start, i))
                # Ищем конченые теги
                elif tmp[-len_end:] == end_tag:
                    end_l.append((i - len_end, i))
                    # Пройден весь вложенный тег
                    if len(end_l) == len(start_l):
                        break
                tmp.append(symbl)
            end_symbols: int = re_str.start() + end_l[-1][-1]
            # Сохраняем тело тега
            res_list.append((''.join(tmp), re_str.start() + last_start, end_symbols + last_start))
            # Начинаем поиск других тегов
            return _self(_text[end_symbols:], last_start=end_symbols)

    res = _self(text)
    return res


if __name__ == '__main__':
    print(ParseTag('pre'))

This will find first match in the html page.

re.search('<pre>([\s\S]+)<\/pre>', text).group(1)

re.search('(?<=<pre>).+(?=<\/pre>)', text, flags=re.DOTALL).group()

Both should do the same.

If you have multiple matches for this, try to use re.findall with this pattern and select needed

output:

Need to get this below block !!!

Example of an env file:

<pre>
!![](./raw/IndisputableInsertCodeFromFile.env)
</pre>

After parsing this file, we will get a Python dictionary:

```
!![](../out/IndisputableInsertCodeFromFile.py)
```

Script code:

```
!![](./raw/ENV_IndisputableInsertCodeFromFile.py)
```

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM