简体   繁体   English

正则表达式记事本++ HTML搜索替换

[英]Regex Notepad++ html search replace

I am trying to batch process (search and replace) a couple hundred thousand html pages with REGEX in Notepad++. 我正在尝试使用记事本++中的REGEX批处理(搜索和替换)数十万个html页面。 All the html pages have the exact same layout and I am basically trying to copy an element (a title) to the page tag wich isn't currently empty 所有的html页面具有完全相同的布局,我基本上是试图将一个元素(标题)复制到页面标记中,而该标记当前不为空

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
...other stuff...</body>

I can find: 我可以找到:

The title tag: <title>(.*?)</title>
And the span containing the REAL title: 
(\s*<div id="uniqueID">\s*)<span>(.*)</span>(\s*</div>)

But I can't seem to be able to fit them into one expression (ignoring the junk in between) to be able to search and replace it in Notepad++. 但是我似乎无法将它们放入一个表达式中(忽略它们之间的垃圾),从而无法在Notepad ++中进行搜索和替换。

The uniqueID div is the same in every pages (spaces, newlines), there is nothing else in it that the span with it's content. 在每个页面(空格,换行符)中,uniqueID div都是相同的,其中没有其他内容是跨度的内容。 The title tag is obviously present only once in every pages. 标题标记显然每页只出现一次。 I just started with regular expressions and the possibilities are endless. 我只是从正则表达式开始,可能性无穷无尽。 I know it's not perfect for parsing HTML but for this case, it should. 我知道它不是解析HTML的完美选择,但在这种情况下,应该这样做。 Anyone knows how to patch theses two expressions together to ignore the in-between content? 有谁知道如何将这两个表达式打补丁在一起以忽略中间的内容?

Thank you so much! 非常感谢!

You can use the following in Notepad++'s Replace dialog to copy the title in the span to the title tag... 您可以在Notepad ++的“替换”对话框中使用以下命令将span中的title复制到title标签...

  • Find what : <title>(.*)</title>(.*<div id="uniqueID">\\s*<span>([A-Za-z ']*)</span>\\s*</div>) 查找内容: <title>(.*)</title>(.*<div id="uniqueID">\\s*<span>([A-Za-z ']*)</span>\\s*</div>)
  • * Replace with : * <title>$3</title>$2 * 替换为:* <title>$3</title>$2

...if you select Regular expression and check . ...如果选择正则表达式并选中 matches newlin in the dialog (yes, "newlin" rather than "newline" - at least in the version of Notepad++ on the machine I am using). 对话框中的newlin匹配 (是的,“ newlin”而不是“ newline”-至少在我使用的计算机上是Notepad ++的版本)。 By using $2 and $3 you are leveraging backreferences to groups' captured values. 通过使用$2$3您可以利用对组捕获值的反向引用。

A less constrained pattern to match the span s with the titles runs the risk of grabbing span s later in the files - for example: span s与标题匹配的约束较少的模式可能会在以后在文件中获取span s的风险,例如:

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
<div>
<span>The text that should not be copied into the head's title tag but will be</span>
</div>
...other stuff...</body>

If the titles to copy from the span s have additional characters other than uppercase and lowercase alpha characters, digits, spaces, and apostrophes, then you can add to the character group [A-Za-z '] as needed (eg [A-Za-z '_] to include underscores). 如果要从span s复制的标题除了大写和小写字母字符,数字,空格和撇号之外还具有其他字符,则可以根据需要添加到字符组[A-Za-z '] (例如[A-Za-z '_] [A-Za-z '] [A-Za-z '_]包括下划线)。 Just watch out for HTML markup characters themselves - eg < and > . 只需注意HTML标记字符本身-例如<>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM