简体   繁体   English

如何使用正则表达式匹配文本并跳过 HTML 标签?

[英]How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field.我在包含富文本字段的QuickBase表中有一堆记录。 In other words, they each contain some paragraphs of text intermingled with HTML tags like <p> , <strong> , etc.换句话说,它们每个都包含一些文本段落,其中混杂着 HTML 标签,如<p><strong>等。

I need to migrate the records to a new table where the corresponding field is a plain text field.我需要将记录迁移到一个新表,其中对应的字段是纯文本字段。 For this, I would like to strip out all HTML tags and leave only the text in the field values.为此,我想去掉所有 HTML 标签,只保留字段值中的文本。

For example, from the below input, I would expect to extract just a small example link to a webpage :例如,从下面的输入中,我希望只提取just a small example link to a webpage

  <p>just a small <a href="#">
  example</a> link</p><p>to a webpage</p> 

As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines ' Text channel tool.由于我试图在不编码或不使用外部工具的情况下快速完成这项工作,因此我只能使用Quickbase Pipelines文本通道工具。 The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.它的工作方式是我定义了一个正则表达式模式,它只输出与模式匹配的位。

So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need.到目前为止,我已经能够想出这个正则表达式(Python 风格,因为 QB 的后端是用 Python 编写的)正确地完成了与我需要的完全相反的事情。 Ie it matches only the HTML tags:即它只匹配 HTML 标签:

/(<[^>]*>)/

In a sense, I need the negative image of this expression but have not be able to build it myself.从某种意义上说,我需要这种表达的负面形象,但自己却无法建立它。

Your help in "negating" the above expression is most appreciated.非常感谢您帮助“否定”上述表达式。

Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind .假设其他地方没有<>或实体编码,这是一个使用lookbehind的想法。

(?:(?<=>)|^)[^<]+

See this demo at regex101请参阅 regex101 上的演示

(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any > . (?:(?<=>)|^)是字符串的^开头或向后查找任何>之间的交替 From there [^<]+ matches one or more characters that are not < ( negated character class ).从那里[^<]+匹配一个或多个不是<否定字符 class )的字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM