如何使用正则表达式匹配文本并跳过 HTML 标签？

Question

I have a bunch of records in a QuickBase table that contain a rich text field.我在包含富文本字段的QuickBase表中有一堆记录。 In other words, they each contain some paragraphs of text intermingled with HTML tags like <p> , <strong> , etc.换句话说，它们每个都包含一些文本段落，其中混杂着 HTML 标签，如<p> 、 <strong>等。

I need to migrate the records to a new table where the corresponding field is a plain text field.我需要将记录迁移到一个新表，其中对应的字段是纯文本字段。 For this, I would like to strip out all HTML tags and leave only the text in the field values.为此，我想去掉所有 HTML 标签，只保留字段值中的文本。

For example, from the below input, I would expect to extract just a small example link to a webpage :例如，从下面的输入中，我希望只提取just a small example link to a webpage ：

  <p>just a small <a href="#">
  example</a> link</p><p>to a webpage</p>

As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines ' Text channel tool.由于我试图在不编码或不使用外部工具的情况下快速完成这项工作，因此我只能使用Quickbase Pipelines的文本通道工具。 The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.它的工作方式是我定义了一个正则表达式模式，它只输出与模式匹配的位。

So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need.到目前为止，我已经能够想出这个正则表达式（Python 风格，因为 QB 的后端是用 Python 编写的）正确地完成了与我需要的完全相反的事情。 Ie it matches only the HTML tags:即它只匹配 HTML 标签：

/(<[^>]*>)/

In a sense, I need the negative image of this expression but have not be able to build it myself.从某种意义上说，我需要这种表达的负面形象，但自己却无法建立它。

Your help in "negating" the above expression is most appreciated.非常感谢您帮助“否定”上述表达式。

Answer 1

Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind .假设其他地方没有<或>或实体编码，这是一个使用lookbehind的想法。

(?:(?<=>)|^)[^<]+

See this demo at regex101请参阅 regex101 上的演示

(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any > . (?:(?<=>)|^)是字符串的^开头或向后查找任何>之间的交替。 From there [^<]+ matches one or more characters that are not < ( negated character class ).从那里[^<]+匹配一个或多个不是< （否定字符 class ）的字符。

如何使用正则表达式匹配文本并跳过 HTML 标签？

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-22 19:48:11

如何使用正则表达式匹配文本并跳过 HTML 标签？

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-22 19:48:11

解决方案1
1 已采纳 2023-01-22 19:48:11