Regex Notepad++ html search replace

Question

I am trying to batch process (search and replace) a couple hundred thousand html pages with REGEX in Notepad++. All the html pages have the exact same layout and I am basically trying to copy an element (a title) to the page tag wich isn't currently empty

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
...other stuff...</body>

I can find:

The title tag: <title>(.*?)</title>
And the span containing the REAL title: 
(\s*<div id="uniqueID">\s*)<span>(.*)</span>(\s*</div>)

But I can't seem to be able to fit them into one expression (ignoring the junk in between) to be able to search and replace it in Notepad++.

The uniqueID div is the same in every pages (spaces, newlines), there is nothing else in it that the span with it's content. The title tag is obviously present only once in every pages. I just started with regular expressions and the possibilities are endless. I know it's not perfect for parsing HTML but for this case, it should. Anyone knows how to patch theses two expressions together to ignore the in-between content?

Thank you so much!

Answer 1

You can use the following in Notepad++'s Replace dialog to copy the title in the span to the title tag...

Find what : <title>(.*)</title>(.*<div id="uniqueID">\\s*<span>([A-Za-z ']*)</span>\\s*</div>)
* Replace with : * <title>$3</title>$2

...if you select Regular expression and check . matches newlin in the dialog (yes, "newlin" rather than "newline" - at least in the version of Notepad++ on the machine I am using). By using $2 and $3 you are leveraging backreferences to groups' captured values.

A less constrained pattern to match the span s with the titles runs the risk of grabbing span s later in the files - for example:

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
<div>
<span>The text that should not be copied into the head's title tag but will be</span>
</div>
...other stuff...</body>

If the titles to copy from the span s have additional characters other than uppercase and lowercase alpha characters, digits, spaces, and apostrophes, then you can add to the character group [A-Za-z '] as needed (eg [A-Za-z '_] to include underscores). Just watch out for HTML markup characters themselves - eg < and > .

Regex Notepad++ html search replace

Question

1 answers

solution1
0 ACCPTED 2014-04-01 16:45:10

Regex Notepad++ html search replace

Question

1 answers

solution1 0 ACCPTED 2014-04-01 16:45:10

solution1
0 ACCPTED 2014-04-01 16:45:10