简体   繁体   中英

Javascript Regular Expression to Parse HTML and word wrap?

I need to create a bit of Javascript that can search inputted HTML from a text box and ignore all the tags to automatically word wrap at a set number like say 70 and add a <br> tag.

I also need to find all the ascii like &copy; and &#150; and count that as one space not 5 or 4 spaces.

So the code would take:

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.

Output would be:

<b>Hello</b> Here is some code that I would like to wrap. Lets pretend <br>
this goes on for over 70 spaces.

Is this possible? How would I begin? Is there already a tool for this?

By the way CSS is out of the question to use.

While the combination of the phrases "regular expression" and "parse HTML" usually causes entire universes to crumble , your use case seems simplistic enough that it could work, but the fact that you want to preserve HTML formatting after wrapping makes it much easier to just work on a space-delimited sequence. Here is a very rough approximation of what you'd like to do:

input = "<b>Hello</b> Here is some code that I would like to wrap. Let's pretend this goes on for over 70 spaces. Better &yen;&euro;&#177;, let's <em>make</em> it go on for more than 70, and pick &uuml;&thorn; a whole <strong>bu&ntilde;&copy;h</strong> of crazy symbols along the way.";
words = input.split(' ');

lengths = [];
for (var i = 0; i < words.length; i++)
  lengths.push(words[i].replace(/<.+>/g, '').replace(/&.+;/g, ' ').length);

line = [], offset = 0, output = [];
for (var i = 0; i < words.length; i ++) {
  if (offset + (lengths[i] + line.length - 1) < 70) {
    line.push(words[i]);
    offset += lengths[i];
  }
  else {
    output.push(line.join(' '));
    offset = 0; line = [], i -= 1;;
  }
  if (i == words.length - 1)
    output.push(line.join(' '));
}

output = output.join('<br />');

which results in

Hello Here is some code that I would like to wrap. Let's pretend this
goes on for over 70 spaces. Better ¥€±, let's make it go on for more
than 70, and pick üþ a whole buñ©h of crazy symbols along the way.

Note that the HTML tags ( b , em , strong ) are preserved, it's just that Markdown doesn't show them.

Basically, the input string is split into words at each space, which is naïve and likely to cause trouble, but it's a start. Then, the length of each word is calculated after anything resembling an HTML tag or entity has been removed. Then it's a simple matter of iterating over each word, keeping a running tally of the column we're on; once we've struck 70, we pop the aggregated words into the output string and reset. Again, it's very rough, but it should suffice for most basic HTML.

This solution "walks" the string token by token counting up to the desired line length. The regex captures one of four different tokens:

  • $1: HTML open/close tag (width = 0)
  • $2: HTML entity. (width = 1)
  • $3: Line terminator. (counter is reset)
  • $4: Any other character. (width = 1)

Note that I've added a line terminator token in case your textbox is already formatted with linefeed (with optional carriage returns). Here is a JavaScript function that walks the string using String.replace() and an anonymous callback counting tokens as it goes:

function breakupHTML(text, len);

// Break up textarea into lines having len chars.
function breakupHTML(text, len) {
    var re = /(<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)|(&(?:\w+|#x[\da-f]+|#\d+);)|(\r?\n)|(.)/ig;
    var count = 0;  // Initialize line char count.
    return text.replace(re,
        function(m0, m1, m2, m3, m4) {
            // Case 1: An HTML tag. Do not add to count.
            if (m1) return m1;
            // Case 2: An HTML entity. Add one to count.
            if (m2) {
                if (++count >= len) {
                    count = 0;
                    m2 += '<br>\n';
                }
                return m2;
            }
            // Case 3: A hard coded line terminator.
            if (m3) {
                count = 0;
                return '<br>\n';
            }
            // Case 4: Any other single character.
            if (m4) {
                if (++count >= len) {
                    count = 0;
                    m4 += '<br>\n';
                }
                return m4;
            } // Never get here.
        });
}

Here's a breakdown of the regex in commented format so you can see what is being captured:

p = re.compile(r"""
    # Match one HTML open/close tag, HTML entity or other char.
      (<(?:[^'"<>]+|'[^']*'|"[^"]*")*>)  # $1: HTML open/close tag
    | (&(?:\w+|\#x[\da-f]+|\#\d+);)      # $2: HTML entity.
    | (\r?\n)                            # $3: Line terminator.
    | (.)                                # $4: Any other character.
    """, re.IGNORECASE | re.VERBOSE)

Not wanting to unleash Cthulhu , I decided (unlike my fellow answers) to instead provide an answer to your problem that does not attempt to parse HTML with regular expressions. Instead, I turned to the awe-inspiring force for good that is jQuery, and used that to parse your HTML on the client side.

A working fiddle: http://jsfiddle.net/CKQ9f/6/

The html:

<div id="wordwrapOriginal">Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g<b class="foo bar">Helloend this goes on for over 70 spaces.etend</b>Here is some code that I would like to wrap. Lets pretend this goes on for over 70 spaces.etend this g</div>
<hr>
<div id="wordwrapResult"></div>

The jQuery:

// lifted from here: https://stackoverflow.com/a/5259788/808921
$.fn.outerHTML = function() {
    $t = $(this);
    if( "outerHTML" in $t[0] )
    { return $t[0].outerHTML; }
    else
    {
        var content = $t.wrap('<div></div>').parent().html();
        $t.unwrap();
        return content;
    }
}

// takes plain strings (no markup) and adds <br> to 
// them when each "line" has exceeded the maxLineLen
function breakLines(text, maxLineLen, startOffset)
{
   var returnVals = {'text' : text, finalOffset : startOffset + text.length};
   if (text.length + startOffset > maxLineLen)
   {
      var wrappedWords = "";
      var wordsArr = text.split(' ');
      var lineLen = startOffset;
      for (var i = 0; i < wordsArr.length; i++)
      {
        if (wordsArr[i].length + lineLen > maxLineLen)
        {
          wrappedWords += '<br>';
          lineLen = 0;
        } 
        wrappedWords += (wordsArr[i] + ' ');
        lineLen += (wordsArr[i].length + 1);
      } 
      returnVals['text'] = wrappedWords.replace(/\s$/, '');
      returnVals['finalOffset'] = lineLen;
   }
   return returnVals;
}

// recursive function which will traverse the "tree" of HTML 
// elements under the baseElem, until it finds plain text; at which 
// point, it will use the above function to add newlines to that text
function wrapHTML(baseElem, maxLineLen, startOffset)
{
    var returnString = "";
    var currentOffset = startOffset;

    $(baseElem).contents().each(function () {
        if (! $(this).contents().length) // plain text
        {
            var tmp = breakLines($(this).text(), maxLineLen, currentOffset);
            returnString += tmp['text'];
            currentOffset = tmp['finalOffset'];

        }
        else // markup
        {
            var markup = $(this).clone();
            var tmp = wrapHTML(this, maxLineLen, currentOffset);
            markup.html(tmp['html']);
            returnString += $(markup).outerHTML();
            currentOffset = tmp['finalOffset'];
        }
    });

    return {'html': returnString, 'finalOffset': currentOffset};
}


$(function () {

   wrappedHTML = wrapHTML("#wordwrapOriginal", 70, 0);

   $("#wordwrapResult").html(wrappedHTML['html']);

});

Note the recursion - can't do that with a regex!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM