简体   繁体   中英

How to find/replace text in html while preserving html tags/structure

I use regexps to transform text as I want, but I want to preserve the HTML tags. eg if I want to replace "stack overflow" with "stack underflow", this should work as expected: if the input is stack <sometag>overflow</sometag> , I must obtain stack <sometag>underflow</sometag> (ie the string substitution is done, but the tags are still there...

Use a DOM library, not regular expressions, when dealing with manipulating HTML:

  • lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
  • BeautifulSoup: a parser, document, and HTML serializer.
  • html5lib: a parser. It has a serializer.
  • ElementTree: a document object, and XML serializer
  • cElementTree: a document object implemented as a C extension.
  • HTMLParser: a parser.
  • Genshi: includes a parser, document, and HTML serializer.
  • xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.

Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ .

Out of these I would recommend lxml, html5lib, and BeautifulSoup.

美丽的汤HTMLParser是您的答案。

Note that arbitrary replacements can't be done unambiguously. Consider the following examples:

1)

HTML:

A<tag>B</tag>

Pattern -> replacement:

AB -> AXB

Possible results:

AX<tag>B</tag>
A<tag>XB</tag>

2)

HTML:

A<tag>A</tag>A

Pattern -> replacement:

A+ -> WXYZ

Possible results:

W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ

What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.

Use html parser such as provided by lxml or BeautifulSoup . Another option is to use XSLT transformations ( XSLT in Jython ).

I don't think that the DOM / HTML parser library recommendations posted so far address the specific problem in the given example: overflow should replaced with underflow only when preceded by stack in the rendered document, whether or not there are tags between them. Such a library is a necessary part the solution, though.

Assuming that tags never appear in the middle of words, one solution would be to

  1. process the DOM, tokenize all text nodes and insert a unique identifier at the beginning of each token (eg word)
  2. render the document as plain text
  3. search and replace the plain text with regexes which use groups to match, preserve and mark unique identifiers at the beginning of each token
  4. extract all tokens with marked unique identifiers from the plain text
  5. process the DOM by removing unique identifiers and replacing tokens matching marked unique identifiers with corresponding changed tokens
  6. render the processed DOM back to HTML

Example:

In 1. the HTML DOM,

stack <sometag>overflow</sometag>

becomes the DOM

#1;stack <sometag>#2;overflow</sometag>

and in 2. the plain text is produced:

#1;stack #2;overflow

The regex needed in 3. is #(\\d+);stack\\s+#(\\d+);overflow\\b and the replacement #\\1;stack %\\2;underflow . Note that only the second word is marked by changing # to % in the unique identifier, since the first word isn't altered.

In 4., the word underflow with the unique identifier numbered 2 is extracted from the resulting plain text since it was marked by changing the # to a % .

In 5., all #(\\d+); identifiers are removed from text nodes of the DOM while looking up their numbers among extracted words. The number 1 is not found, so #1;stack is replaced with simply stack . The number 2 is found with the changed word underflow , so #2;overflow is replaced by underflow .

Finally in 6. the DOM is rendered back to the HTML document `stack underflow.

Fun stuff to try. It sorta works. My friends like it when I attach this script to a textarea and let them "translate" things. I guess you could use it for anything really. Meh. Check the code over a few times if you're going to use it, it works but I'm new to all this. I think it's been 2 or three weeks since I started studying the php.


<?php

$html = ('<div style="border: groove 2px;"><p>Dear so and so, after reviewing your application I. . .</p><p>More of the same...</p><p>sincerely,</p><p>Important Dude</p></div>');

$oldWords = array('important', 'sincerely');

$newWords = array('arrogant', 'ya sure');

// function for oldWords
function regex_oldWords_word_list(&$item1, $key)
{

    $item1 = "/>([^<>]+)?\b$item1(tionally|istic|tion|ance|ence|less|ally|able|ness|ing|ity|ful|ant|est|ist|ic|al|ed|er|et|ly|y|s|d|'s|'d|'ve|'ll)?\b([^<>]+)?/";

}

// function for newWords
function format_newWords_results(&$item1, $key)
{

    $item1 = ">$1<span style=\"color: red;\"><em> $item1$2</em></span>$3";

}

// apply regex to oldWords
array_walk($oldWords, 'regex_oldWords_word_list');

// apply formatting to newWords
array_walk($newWords, 'format_newWords_results');

//HTML is not always as perfect as we want it
$poo = array('/  /', '/>([a-zA-Z\']+)/', '/’/', '/;([a-zA-Z\']+)/', '/"([a-zA-Z\']+)/', '/([a-zA-Z\']+)</', '/\.\.+/', '/\. \.+/');

$unpoo = array(' ', '> $1', '\'', ';  $1', '"  $1', '$1  <', '. crap taco.', '. crap taco with cheese.');

//and maybe things will go back to normal sort of
$repoo = array('/>  /', '/;  /', '/"  /', '/  </');

$muck = array('> ', ';', '"',' <');

//before
echo ($html);

//I don't know what was happening on the free host but I had to keep stripping slashes
//This is where the work is done anyway.
$html = stripslashes(preg_replace($repoo , $muck , (ucwords(preg_replace($oldWords , $newWords , (preg_replace($poo , $unpoo , (stripslashes(strtolower(stripslashes($html)))))))))));

//after
echo ('<hr/> ' . $html);

//now if only there were a way to keep it out of the area between
//<style>here</style> and <script>here</script> and tell it that english isn't math.

?>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM