正則表達式記事本++ HTML搜索替換

Question

我正在嘗試使用記事本++中的REGEX批處理（搜索和替換）數十萬個html頁面。 所有的html頁面具有完全相同的布局，我基本上是試圖將一個元素（標題）復制到頁面標記中，而該標記當前不為空

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
...other stuff...</body>

我可以找到：

The title tag: <title>(.*?)</title>
And the span containing the REAL title: 
(\s*<div id="uniqueID">\s*)<span>(.*)</span>(\s*</div>)

但是我似乎無法將它們放入一個表達式中（忽略它們之間的垃圾），從而無法在Notepad ++中進行搜索和替換。

在每個頁面（空格，換行符）中，uniqueID div都是相同的，其中沒有其他內容是跨度的內容。 標題標記顯然每頁只出現一次。 我只是從正則表達式開始，可能性無窮無盡。 我知道它不是解析HTML的完美選擇，但在這種情況下，應該這樣做。 有誰知道如何將這兩個表達式打補丁在一起以忽略中間的內容？

非常感謝！

Answer 1

您可以在Notepad ++的“替換”對話框中使用以下命令將span中的title復制到title標簽...

查找內容： <title>(.*)</title>(.*<div id="uniqueID">\\s*<span>([A-Za-z ']*)</span>\\s*</div>)
* 替換為：* <title>$3</title>$2

...如果選擇正則表達式並選中。 與對話框中的newlin匹配 （是的，“ newlin”而不是“ newline”-至少在我使用的計算機上是Notepad ++的版本）。 通過使用$2和$3您可以利用對組捕獲值的反向引用。

將span s與標題匹配的約束較少的模式可能會在以后在文件中獲取span s的風險，例如：

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
<div>
<span>The text that should not be copied into the head's title tag but will be</span>
</div>
...other stuff...</body>

如果要從span s復制的標題除了大寫和小寫字母字符，數字，空格和撇號之外還具有其他字符，則可以根據需要添加到字符組[A-Za-z '] （例如[A-Za-z '_] [A-Za-z '] [A-Za-z '_]包括下划線）。 只需注意HTML標記字符本身-例如<和> 。

正則表達式記事本++ HTML搜索替換

問題描述

1 個解決方案

解決方案1
0 已采納 2014-04-01 16:45:10

正則表達式記事本++ HTML搜索替換

問題描述

1 個解決方案

解決方案1 0 已采納 2014-04-01 16:45:10

解決方案1
0 已采納 2014-04-01 16:45:10