简体   繁体   English

用正则表达式删除和替换字符

[英]Remove and replace characters by regex

I'm trying to write a regex that makes the next things: 我正在尝试编写一个正则表达式,使接下来的事情:

  1. _ -> replace it by a space _ >用空格替换
  2. + -> remove it if there is not another + after it (ie c++ => c++ . c+ -> c ) + ->如果没有其他+,请将其删除(即c++ => c++c+ -> c
  3. ' -> remove it if it's in the start or end of the word (ie Alin's -> Alin's . 'Alin's -> alin's ) ' - >删除它,如果它在该单词的开始或结束(即Alin's - > Alin's'Alin's - > alin's
  4. & , - , . &- . , ! ! - Don't remove. -不要删除。
  5. Another special characters - remove 另一个特殊字符-删除

I want to do it by passing one time the string 我想通过传递一次字符串来做到这一点

for example: 例如:

Input: "abc's, test_s! & c++ c+ 'Dirty's'. and beautiful'..."
Output: "abc's test s! & c++ c Dirty's. and beautiful..."

Explanation: 说明:

char `'` in `abc's,` stays because `3`
char `,` in `abc's,` was removed because `5` 
char `_` in `test_s!` was replaced by space because `1`
char `!` in `test_s!` is not removed because `!`
char `&` is not removed because `4`
char `+` in `c++` is not removed because `2`
char `+` in `c+` was removed because `2`
word: `'Dirty's'.` was replaced to `Dirty's.` because `3` and `4`
char `'` in `beautiful'...` was removed because `3`
char `.` is not removed because of `4`

This is my javascript code: 这是我的javascript代码:

var str = "abc's test_s c++ c+ 'Dirty's'. and beautiful";
console.log(str);
str = str.replace(/[_]/g, " ");
str = str.replace(/[^a-zA-Z0-9 &-.!]/g, "");
console.log(str);

This is my jsfiddle: http://jsfiddle.net/alonshmiel/LKjYd/4/ 这是我的jsfiddle: http : //jsfiddle.net/alonshmiel/LKjYd/4/

I don't like my code because I'm sure that it's possible to do it by running one time over the string. 我不喜欢我的代码,因为我确信可以通过在字符串上运行一次来​​做到这一点。

Any help appreciated! 任何帮助表示赞赏!

 function sanitize(str){ return str.replace(/(_)|(\\'\\W|\\'$)|(^\\'|\\W\\')|(\\+\\+)|([a-zA-Z0-9\\ \\&\\-\\.\\!\\'])|(.)/g,function(car,p1,p2,p3,p4,p5,p6){ if(p1) return " "; if(p2) return sanitize(p2.slice(1)); if(p3) return sanitize(p3.slice(0,-1)); if(p4) return p4.slice(0,p4.length-p4.length%2); if(p5) return car; if(p6) return ""; }); } document.querySelector('#sanitize').addEventListener('click',function(){ document.querySelector('#output').innerHTML= sanitize(document.querySelector('#inputString').value); }); 
 #inputString{ width:290px } #sanitize{ background: #009afd; border: 1px solid #1777b7; border:none; color:#fff; cursor:pointer; height: 1.55em; } #output{ background:#ddd; margin-top:5px; width:295px; } 
 <input id="inputString" type="text" value="abc's test_s! & c++ c+ 'Dirty's'. and beau)'(tiful'..."/> <input id="sanitize" type="button" value="Sanitize it!"" /> <div id="output" ></div> 

some points: 一些要点:

  • one pass constraint is not fully respected, due to the obligation to sanitize the character captured with \\W. 由于有义务对用\\ W捕获的字符进行清理,因此未完全遵守一次通过约束。 I do not find any other way. 我没有找到其他办法。
  • about the ++ rule: any sequence of + is reduced by one + if impair. 关于++规则:+的任何序列如果减损都会减少一个+。
  • apostrophs are only removed if there is a non alphanumeric character next to it. 仅当撇号旁边有非字母数字字符时,才删除撇号。 What should you want to do with, for example: "abc'&". 您应该如何处理,例如:“ abc'&”。 "abc&" or "abc'&"? “ abc&”还是“ abc'&”? And also for "ab_'s". 并且也适用于“ ab_”。

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Guide/Regular_Expressions

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#Specifying_a_function_as_a_parameter https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Reference/Global_Objects/String/replace#Specifying_a_function_as_a_parameter

Because the replacement you need can be different (nothing or a space), you can't use a fixed string (due to the one-pass constraint). 因为您需要的替换项可以不同(没有空格),所以不能使用固定的字符串(由于单遍约束)。 So the only way is to use a dynamic replacement. 因此,唯一的方法是使用动态替换。

direct approach: 直接方法:

let's try to find the characters to remove, and to preserve in certain cases the others: 让我们尝试找到要删除的字符,并在某些情况下保留其他字符:

var str = "abc's, test_s! & c++ c+ 'Dirty's'. and beautiful'...";

var re = /[^\w\s&.!'+-]+|\B'+|'+\B|(\+{2,})|\+|'*(_)'*/g; 

var result = str.replace(re, function (_, g1, g2) {
    if (g1) return g1;
    return (g2) ? ' ' : ''; });

console.log(result);

when an underscore is found, the capture group 2 is defined ( g2 in the callback function) and a space is returned. 找到下划线时,将定义捕获组2(回调函数中的g2 )并返回一个空格。

Note: in the above example the term "word" is taken in a regex meaning (the character class \\w so [a-zA-Z0-9_] except for the underscore), but if you want to be more rigorous, for example to exclude single quotes near digits, you need to change the pattern a little: 注意:在上面的示例中,单词“ word”以正则表达式表示(字符类\\w所以下划线除外[a-zA-Z0-9_] ),但是例如,如果您想更严格一些,要排除数字附近的单引号,您需要稍微改变一下模式:

var re = /[^\w\s&.!'+-]+|(_)'*|([^a-z])'+|'+(?![a-z])|(\+{2,})|\+|^'+/gi;

var result = str.replace(re, function (_, g1, g2, g3) {
    if (g2) return g2;
    if (g3) return g3;
    return (g1) ? ' ' : ''; });

Note about the two patterns: 请注意以下两种模式:

These two patterns consist in an alternation of 6 or 7 subpatterns that can match about 1 or 2 characters most of the time. 这两种模式由6个或7个子模式交替组成,这些子模式在大多数情况下可以匹配大约1个或2个字符。 Keep in mind that to find a character to remove, these patterns must test the 6 or 7 alternatives before failing for each character that must not be replaced. 请记住,要找到要删除的字符,这些模式必须在无法替换的每个字符失败之前测试6或7个替代方案。 It's an important cost and most of the time a character doesn't need to be replaced. 这是一项重要的成本,大多数时候不需要替换角色。

There is a way to reduce this cost you can apply here: the first character discrimination 有一种方法可以减少这种费用,您可以在这里申请:第一个字符的辨别

The idea is to avoid as much as possible to test each subpatterns. 这样做的想法是尽可能避免测试每个子模式。 This can be done here because all subpatterns don't begin with a letter, so you can quickly skip all characters that are a letter without to have to test each subpatterns, if you add a lookahead at the begining. 可以在此处完成此操作,因为并非所有子模式都以字母开头,因此,如果在开头添加了前瞻功能,则可以快速跳过字母中的所有字符而不必测试每个子模式。 Example for pattern 2: 模式2的示例

var re = /(?=[^a-z])(?:[^\w\s&.!'+-]+|(_)'*|([^a-z])'+|'+(?![a-z])|(\+{2,})|\+|^'+)/gi;

For the first pattern you can skip more characters: 对于第一种模式,您可以跳过更多字符:

var re = /(?=[^a-z0-9\s&.!-])(?:[^\w\s&.!'+-]+|\B'+|'+\B|(\+{2,})|\+|'*(_)'*)/gi;

Despite these improvements, these two patterns need a lot of steps for a small string (~400) (but consider that it's an example string with all the possible cases in it) . 尽管有这些改进,但是对于一个小的字符串(〜400),这两种模式仍需要很多步骤(但请注意,这是一个示例字符串,其中包含所有可能的情况)

a more indirect approach: 一种更间接的方法:

Now let's try an other way that consists to find a character to replace, but this time with all characters before it. 现在让我们尝试另一种方法,该方法包括找到要替换的字符,但是这次要替换所有字符。

var re = /((?:[a-z]+(?:'[a-z]+)*|\+{2,}|[\s&.!-]+)*)(?:(_)|.)?/gi

var result = str.replace(re, function (_, g1, g2) {
    return g1 + ((g2) ? ' ' : '' );
});

(Note that there is no need to prevent a catastrophic backtracking because (?:a+|b+|c+)* is followed by an always-true subpattern (?:d|e)? . Beside, the whole pattern will never fail whatever the string or the position in it.) (请注意,由于(?:a+|b+|c+)*之后是始终为真的子模式(?:d|e)?因此无需防止灾难性的回溯。此外,整个模式将永远不会失败。字符串或其中的位置。)

All characters before the character to replace (the allowed content) are captured and returned by the callback function. 回调函数捕获并替换要替换的字符之前的所有字符(允许的内容)并返回。

This way needs more than 2x less steps to do the same job. 这样,完成同一工作所需的步骤减少了2倍以上。

What you need is chaining and alternation operator 您需要的是链接和交替运算符

function customReplace(str){
   return str.replace(/_/g, " ").replace(/^'|'$|[^a-zA-Z0-9 &-.!]|\+(?=[^+])/g,"");
}

The regex /^'|'$|[^a-zA-Z0-9 &-.!]|\\+(?=[^+])/g combines all that is needed to be removed. 正则表达式/^'|'$|[^a-zA-Z0-9 &-.!]|\\+(?=[^+])/g组合了所有需要删除的内容。 And we replace all _ by a space, which we finally return. 然后将所有_替换为一个空格,最后返回该空格。

\\+(?=[^+]) looks for + that is followed by anything except + \\+(?=[^+])查找+ ,其后跟随+

Also, the ordering of the replace is important. 同样,替换的顺序很重要。

Try this: by regex /(?!\\b)'|'(?=\\B)|^'|'$|[^\\w\\d\\s&-.!]|\\+(?=[^+])/gm 试试这个:通过正则表达式/(?!\\b)'|'(?=\\B)|^'|'$|[^\\w\\d\\s&-.!]|\\+(?=[^+])/gm

 function sanitize(str) { var re = /(?!\\b)'|'(?=\\B)|^'|'$|[^\\w\\d\\s&-.!]|\\+(?=[^+])/gm; var subst = ''; var tmp = str.replace(re, subst); // remove all condition without (_) var result = tmp.replace("_", " "); // next replace (_) by ( ) space. return result; } document.querySelector('#sanitize').addEventListener('click', function() { document.querySelector('#output').innerHTML = sanitize(document.querySelector('#inputString').value); }); 
 #inputString { width: 290px } #sanitize { background: #009afd; border: 1px solid #1777b7; border: none; color: #fff; cursor: pointer; height: 1.55em; } #output { background: #eee; margin-top: 5px; width: 295px; } 
 <input id="inputString" type="text" value="abc's test_s! & c++ c+ 'Dirty's'. and beau)'(tiful'..." /> <input id="sanitize" type="button" value="Sanitize it!" /> <div id="output"></div> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM