简体   繁体   中英

How can I use a Regular Expression to replace everything except specific words in a string with Javascript

Imagine you have a string like this: "This is a sentence with words."

I have an array of words like $wordList = ["sentence", "words"];

I want to highlight words that aren't on the list. Which means I need to find and replace everything else and I can't seem to crack how to do that (if it's possible) with RegEx.

If I want to match the words I can do something like:

text = text.replace(/(sentence|words)\\b/g, '<mark>$&</mark>');

(which will wrap the matching words in "mark" tags and, assuming I have some css for <mark> , highlight them) which works perfectly. But I need the opposite! I need it to basically select the entire string and then exclude the words listed. I've tried /^((?!sentence|words)*)*$/gm but this gives me a strange infinity issue because I think it's too open ended.

Taking that original sentence, what I would hope to end up with is "<mark> This is a </mark> sentence <mark> with some </mark> words."

Basically wrapping (via replace) everything except the words listed.

The closest I can seem to get is something like /^(?!sentence|words).*\\b/igm which will successfully do it if a line starts with one of the words (ignoring that entire line).

So to summarize: 1) Take a string 2) take a list of words 3) replace everything in the string except the list of words.

Possible? (jQuery is loaded for something else already, so raw JS or jQuery are both acceptable).

Create the regex from the word list.
Then do a string replace with the regex.
(It's a tricky regex)

 var wordList = ["sentence", "words"]; // join the array into a string using '|'. var str = wordList.join('|'); // finalize the string with a negative assertion str = '\\\\W*(?:\\\\b(?!(?:' + str + ')\\\\b)\\\\w+\\\\W*|\\\\W+)+'; //create a regex from the string var Rx = new RegExp( str, 'g' ); console.log( Rx ); var text = "%%%555This is a sentence with words, but not sentences ?!??!!..."; text = text.replace( Rx, '<mark>$&</mark>'); console.log( text ); 

Output

/\W*(?:\b(?!(?:sentence|words)\b)\w+\W*|\W+)+/g
<mark>%%%555This is a </mark>sentence<mark> with </mark>words<mark>, but not sentences ?!??!!...</mark>

Addendum

The regex above assumes the word list contains only word characters.
If that's not the case, you must match the words to advance the match position
past them. This is easily accomplished with a simplified regex and a callback function.

 var wordList = ["sentence", "words", "won't"]; // join the array into a string using '|'. var str = wordList.join('|'); str = '([\\\\S\\\\s]*?)(\\\\b(?:' + str + ')\\\\b|$)'; //create a regex from the string var Rx = new RegExp( str, 'g' ); console.log( Rx ); var text = "%%%555This is a sentence with words, but won't be sentences ?!??!!..."; // Use a callback to insert the 'mark' text = text.replace( Rx, function(match, p1,p2) { var retStr = ''; if ( p1.length > 0 ) retStr = '<mark>' + p1 + '</mark>'; return retStr + p2; } ); console.log( text ); 

Output

/([\S\s]*?)(\b(?:sentence|words|won't)\b|$)/g
<mark>%%%555This is a </mark>sentence<mark> with </mark>words<mark>, but 
</mark>won't<mark> be sentences ?!??!!...</mark>

You could still perform the replacement on the positive matches, but reverse the closing/opening tag, and add an opening tag at the start and a closing one at the end of the string. I use here your regular expression which could be anything you want, so I'll assume it matches correctly what needs to be matched:

 var text = "This is a sentence with words."; text = "<mark>" + text.replace(/\\b(sentence|words)\\b/g, '</mark>$&<mark>') + "</mark>"; // If empty tags bother you, you can add: text = text.replace(/<mark><\\/mark>/g, ""); console.log(text); 

Time Complexity

In comments below someone makes a point that the second replacement (which is optional) is a waste of time. But it has linear time complexity as is illustrated in the following snippet which charts the duration for increasing string sizes.

The X axis represents the number of characters in the input string, and the Y-axis represents the number of milliseconds it takes to execute the replacement with /<mark><\\/mark>/g on such input string:

 // Reserve memory for the longest string const s = '<mark></mark>' + '<mark>x</mark>'.repeat(2000); regex = /<mark><\\/mark>/g, millisecs = {}; // Collect timings for several string sizes: for (let size = 100; size < 25000; size+=100) { millisecs[size] = test(15, 8, _ => s.substr(0, size).replace(regex, '')); } // Show results in a chart: chartFunction(canvas, millisecs, "len", "ms"); // Utilities function test(countPerRun, runs, f) { let fastest = Infinity; for (let run = 0; run < runs; run++) { const started = performance.now(); for (let i = 0; i < countPerRun; i++) f(); // Keep the duration of the fastest run: fastest = Math.min(fastest, (performance.now() - started) / countPerRun); } return fastest; } function chartFunction(canvas, y, labelX, labelY) { const ctx = canvas.getContext('2d'), axisPix = [40, 20], largeY = Object.values(y).sort( (a, b) => b - a )[ Math.floor(Object.keys(y).length / 10) ] * 1.3; // add 30% to value at the 90th percentile max = [+Object.keys(y).pop(), largeY], coeff = [(canvas.width-axisPix[0]) / max[0], (canvas.height-axisPix[1]) / max[1]], textAlignPix = [-8, -13]; ctx.translate(axisPix[0], canvas.height-axisPix[1]); text(labelY + "/" + labelX, [-5, -13], [1, 1], false, 2); // Draw axis lines for (let dim = 0; dim < 2; dim++) { const c = coeff[dim], world = [c, 1]; let interval = 10**Math.floor(Math.log10(60 / c)); while (interval * c < 30) interval *= 2; if (interval * c > 60) interval /= 2; let decimals = ((interval+'').split('.')[1] || '').length; line([[0, 0], [max[dim], 0]], world, dim); for (let x = 0; x <= max[dim]; x += interval) { line([[x, 0], [x, -5]], world, dim); text(x.toFixed(decimals), [x, textAlignPix[1-dim]], world, dim, dim+1); } } // Draw function line(Object.entries(y), coeff); function translate(coordinates, world, swap) { return coordinates.map( p => { p = [p[0] * world[0], p[1] * world[1]]; return swap ? p.reverse() : p; }); } function line(coordinates, world, swap) { coordinates = translate(coordinates, world, swap); ctx.beginPath(); ctx.moveTo(coordinates[0][0], -coordinates[0][1]); for (const [x, y] of coordinates.slice(1)) ctx.lineTo(x, -y); ctx.stroke(); } function text(s, p, world, swap, align) { // align: 0=left,1=center,2=right const [[x, y]] = translate([p], world, swap); ctx.font = '9px courier'; ctx.fillText(s, x - 2.5*align*s.length, 2.5-y); } } 
 <canvas id="canvas" width="600" height="200"></canvas> 

For each string size (which is incremented with steps of 100 characters), the time to run the regex 15 times is measured. This measurement is repeated 8 times and the duration of the fastest run is reported in the graph. On my PC the regex runs in 25µs on a string with 25 000 characters (consisting of <mark> tags). So not something to worry about ;-)

You may see some spikes in the chart (due to browser and OS interference), but the overall tendency is linear. Given that the main regex has linear time complexity, the overall time complexity is not negatively affected by it.

However that optional part can be performed without regular expression as follows:

if (text.substr(6, 7) === '</mark>') text = text.substr(13);
if (text.substr(-13, 6) === '<mark>') text = text.substr(0, text.length-13);

Due to how JavaScript engines deal with strings (immutable), this longer code runs in constant time.

Of course, it does not change the overall time complexity, which remains linear.

I'm not sure if this will work for every case, but for the given string it does.

 let s1 = "This is a sentence with words."; let wordList = ["sentence", "words"]; let reg = new RegExp("([\\\\s\\\\S]*?)(" + wordList.join("|") + ")", "g"); console.log(s1.replace(reg, "<mark>$1</mark>$2")) 

Do it the opposite way: Mark everything and unmark the matched words you have.

text = `<mark>${text.replace(/\b(sentence|words)\b/g, '</mark>$&<mark>')}</mark>`;

Negated regex is possible but inefficient for this. In fact regex is not the right tool. The viable method is to go through the strings and manually construct the end string:

//var text = "This is a sentence with words.";
//var wordlist = ["sentence", "words"];
var result = "";
var marked = false;
var nextIndex = 0;

while (nextIndex != -1) {
    var endIndex = text.indexOf(" ", nextIndex + 1);
    var substring = text.slice(nextIndex, endIndex == -1 ? text.length : endIndex);
    var contains = wordlist.some(word => substring.includes(word));
    if (!contains && !marked) {
        result += "<mark>";
        marked = true;
    }
    if (contains && marked) {
        result += "</mark>";
        marked = false;
    }
    result += substring;
    nextIndex = endIndex;
}

if (marked) {
    result += "</mark>";
}
text = result;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM