简体   繁体   中英

Regex to split HTML by Tags which text contain less that n characters

I want to split the following string by <p> tags which contain text less than 4 characters. Let's say <p>1</p> , <p>2</p> using Regex.

<span id="_ctl0_contentMain__kDP_dp_Text" class="kDPText">
<p>1</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
<p>2</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
</span>

The following regex matches <p>...</p> with up to three characters between the tags:

<p>.{0,3}<\/p>

Demo:

 var input = `<span id="_ctl0_contentMain__kDP_dp_Text" class="kDPText"> <p>1</p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p> <p>2</p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p> </span>`; console.log(input.split(/<p>.{0,3}<\\/p>/)); 

If you want to resort to Regular Expression, you can resort to something similar to this code.

var string_to_split= document.getElementById("_ctl0_contentMain__kDP_dp_Text").innerHTML
var your_regExp = new RegExp("<p>.{0,3}<\/p>","ig");
var result = string_to_split.split(your_regExp).filter(function(x) {return x.trim().length;});

If you do not want to resort to RegEx you can use a script like this one (still vanilla javascript, but in older browser [ie ie8] you would use a polyfill for querySelectorAll , I guess ):

var allParagraph = document.querySelectorAll("#_ctl0_contentMain__kDP_dp_Text > p");
var split_para = Array.prototype.reduce.call(
    allParagraph,
    function(acc, x) { 
      if (x.innerHTML.length < 4) {
        acc.unshift([]);
      } else {
        acc[0].push(x);
      }
      return acc;
    },
    []
).reverse();

Sure, the first one solution is simpler but in the result variable there is a string, the split_para array has the original paragraph into an array grouped following your splitting specification

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM