[英]Regex: How to select everything but a specified regex pattern
我正在嘗試創建一個能夠選擇文本中除指定模式之外的所有內容的正則表達式。
正如您在這里看到的: https ://regex101.com/r/kFJFVi/2
我想忽略的文本模式是這個<([^>]+?)([^>]*?)>(.*?)<\/\1>
。 我嘗試使用一些策略,但到目前為止沒有成功。
基於問題例如: ^.*(<([^>]+?)([^>]*?)>(.*?)<\/\1>)?.*$
但此模式選擇所有文本並且不忽略標簽
我也檢查了這個問題:但在這種情況下
使用此正則表達式的示例基礎:
This is the second paragraph. It contains an ordered list: <ol> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ol> This is a text after the list in the second paragraph. This is another part of a paragraph <ol> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ol> This is a text after the other list in the second paragraph. This is a text after the list in the second paragraph. This is another part of a paragraph <ol> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ol> test to odfjdf iofsdfsoh
預期的結果是:
第一場比賽
This is a text after the list in the second paragraph.
This is another part of a paragraph
第二場比賽
This is a text after the other list in the second paragraph.
This is a text after the list in the second paragraph.
This is another part of a paragraph
第三場比賽
test to odfjdf iofsdfsoh
第4場比賽:
test to odfjdf iofsdfsoh
基本上,所有不在 HTML 標記中的文本。
如果 RegExp 不是絕對要求:
使用 DOMParser 解析 XML/HTML 通常比使用 RegExp 更容易。 下面的代碼創建一個新文檔,刪除<ol>
標簽,並清理結果。
const p = new DOMParser(); const doc = p.parseFromString(document.getElementById("content").innerHTML, "text/html"); doc.querySelectorAll("body ol").forEach(n=>doc.querySelector("body").removeChild(n)); let result = doc.querySelector("body").textContent.split("\n"); result = result.map(str=>str.trim()).filter(str=>str.trim();== ""). console;log(result);
<div id="content"> This is the second paragraph. It contains an ordered list: <ol> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ol> This is a text after the list in the second paragraph. This is another part of a paragraph <ol> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ol> This is a text after the other list in the second paragraph. This is a text after the list in the second paragraph. This is another part of a paragraph <ol> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ol> test to odfjdf iofsdfsoh </div>
感謝 Jay,我找到了一種檢索解決方案的方法。 由於他們在 Javascript 中的帖子,我找到了一種查找正則表達式反轉搜索的方法。
我的解決方案是在 C#
var content = @"
This is the second paragraph. It contains an ordered list:
<ol>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ol>
This is a text after the list in the second paragraph.
This is another part of a paragraph
<ol>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ol>
This is a text after the other list in the second paragraph.
This is a text after the list in the second paragraph.
This is another part of a paragraph
<ol>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ol>
test to odfjdf iofsdfsoh
";
// first thing: I created a regex group for the string I want to ignore.
Regex textOutsideTag = new(@"(?<innerTags><([^>]+?)([^>]*?)>(.*?)<\/\1>)", RegexOptions.Singleline);
// Using linq, I select all matches and after that I made the replacement for the string {break} for break lines and receive it as array;
var textGroups = textOutsideTag
.Matches(content)
.Select(p => content.Replace(p.Groups["innerTags"].Value, "{break}"))
.FirstOrDefault(content)
.Split("{break}");
foreach(var texts in textGroups){
Console.WriteLine(texts);
}
/// output:
This is the second paragraph. It contains an ordered list:
This is a text after the list in the second paragraph.
This is another part of a paragraph
This is a text after the other list in the second paragraph.
This is a text after the list in the second paragraph.
This is another part of a paragraph
test to odfjdf iofsdfsoh
要創建一個正則表達式來選擇文本中除指定模式之外的所有內容,您可以使用否定先行斷言。 否定先行斷言允許您指定不應匹配的模式,並且僅當模式不存在時正則表達式才會匹配。
例如,要匹配問題中指定的 HTML 標記中未包含的所有文本,您可以使用以下正則表達式:
(?!<([^>]+?)([^>]*?)>(.*?)<\/\1>).*
這個正則表達式將匹配任何字符 (.) 零次或多次 (*),只要它后面沒有跟隨 ((?....)) 指定的 HTML 標記模式。
例子:
let input = "..."; // the input text
let regex = /(?!<([^>]+?)([^>]*?)>(.*?)<\/\1>).*/g; // the regular expression
let matches = input.match(regex); // get the matches
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.