简体   繁体   English

正则表达式到 select 前 n 个单词和 HTML 标签周围

[英]Regexp to select first n words and HTML tags around them

I know its possible to select only words or select only HTML tags in a given string.我知道给定字符串中的 select 仅字或 select 仅 HTML 标记是可能的。 But is it possible to select both?但是是否可以同时使用 select ?

In this example lets say we want to select first 5 words and HTML tags around them:在此示例中,假设我们想要 select前 5 个单词和 HTML 标记围绕它们:

Input:输入:

<p><strong>This is</strong> <span style="font-size: 1em;">test</span> <strong><em>five</em></strong> words.</p> 
test <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>

Expected Output:预期 Output:

<p><strong>This is</strong> <span style="font-size: 1em;">test</span> <strong><em>five</em></strong> words.</p>

It's straight forward to write regexp to match all words or to match all HTML tags but not sure how to achieve above result using only regexp.编写正则表达式以匹配所有单词或匹配所有 HTML 标记很简单,但不确定如何仅使用正则表达式来实现上述结果。

I know it's not regexp, but it is pure javascript and mostly the preferred method when working with 'selecting nodes' in a document: XPath.我知道它不是正则表达式,但它是纯 javascript 并且在使用文档中的“选择节点”时主要是首选方法:XPath。

With this piece of XPath you will select the largest node containing the text "This is test":使用这块 XPath 您将 select 包含文本“这是测试”的最大节点:

document.evaluate("/html/body//*[contains(.,'This is test')]", document);

In your example, the first <p> will get selected, including its child tags.在您的示例中,第一个<p>将被选中,包括其子标签。

The above function returns an XPathResult and you'll have to iterate over it to do whatever you wnat with it.上面的 function 返回一个XPathResult ,你必须迭代它来做任何你想做的事情。 You can iterate over it again to get all childnodes back or just the text of it.您可以再次对其进行迭代以获取所有子节点或仅获取其文本。 Iterating over the result set and extracting the data should be recursive, but I just created a simple example to get the idea.遍历结果集并提取数据应该是递归的,但我只是创建了一个简单的示例来了解这个想法。

An example jsFiddle一个例子 jsFiddle

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM