简体   繁体   中英

Regex remove all leading and trailing special characters?

Let's say I have the following string in javascript:

&a.b.c. &a.b.c& .&a.b.c.&. *;a.b.c&*. a.b&.c& .&a.b.&&dc.& &ê.b..c&

I want to remove all the leading and trailing special characters (anything which is not alphanumeric or alphabet in another language) from all the words.

So the string should look like

a.b.c a.b.c a.b.c a.b.c a.b&.c a.b.&&dc ê.b..c

Notice how the special characters in between the alphanumeric is left behind. The last ê is also left behind.

This regex should do what you want. It looks for

  • start of line, or some spaces (^| +) captured in group 1
  • some number of symbol characters [!-\\/:-@\\[-``\\{-~]*
  • a minimal number of non-space characters ([^ ]*?) captured in group 2
  • some number of symbol characters [!-\\/:-@\\[-``\\{-~]*
  • followed by a space or end-of-line (using a positive lookahead) (?=\\s|$)

Matches are replaced with just groups 1 and 2 (the spacing and the characters between the symbols).

 let str = '&a.bc &a.bc& .&a.bc&. *;abc&*. ab&.c& .&a.b.&&dc.& &ê.b..c&'; str = str.replace(/(^| +)[!-\\/:-@\\[-`\\{-~]*([^ ]*?)[!-\\/:-@\\[-`\\{-~]*(?=\\s|$)/gi, '$1$2'); console.log(str);

Note that if you want to preserve a string of punctuation characters on their own (eg as in Apple & Sauce ), you should change the second capture group to insist on there being one or more non-space characters ( ([^ ]+?) ) instead of none and add a lookahead after the initial match of punctuation characters to assert that the next character is not punctuation:

 let str = 'Apple &&& Sauce; -This + !That!'; str = str.replace(/(^| +)[!-\\/:-@\\[-`\\{-~]*(?![!-\\/:-@\\[-`\\{-~])([^ ]+?)[!-\\/:-@\\[-`\\{-~]*(?=\\s|$)/gi, '$1$2'); console.log(str);

a-zA-Z\À-\ſ is used to capture all valid characters, including diacritics.

The following is a single regular expression to capture each individual word. The logic is that it will look for the first valid character as the beginning of the capture group, and then the last sequence of invalid characters before a space character or string terminator as the end of the capture group.

 const myRegEx = /[^a-zA-Z\À-\ſ]*([a-zA-Z\À-\ſ].*?[a-zA-Z\À-\ſ]*)[^a-zA-Z\À-\ſ]*?(\\s|$)/g; let myString = '&a.bc &a.bc& .&a.bc&. *;abc&*. ab&.c& .&a.b.&&dc.& &ê.b..c&'.replace(myRegEx, '$1$2'); console.log(myString);

Something like this might help:

 const string = '&a.bc &a.bc& .&a.bc&. *;abc&*. ab&.c& .&a.b.&&dc.& &ê.b..c&'; const result = string.split(' ').map(s => /^[^a-zA-Z0-9ê]*([\\w\\W]*?)[^a-zA-Z0-9ê]*$/g.exec(s)[1]).join(' '); console.log(result);

Note that this is not one single regex, but uses JS help code.

Rough explanation: We first split the string into an array of strings, divided by spaces. We then transform each of the substrings by stripping the leading and trailing special characters. We do this by capturing all special characters with [^a-zA-Z0-9ê]* , because of the leading ^ character it matches all characters except those listed, so all special characters. Between these two groups we capture all relevant characters with ([\\w\\W]*?) . \\w catches words, \\W catches non-words, so \\w\\W catches all possible characters. By appending the ? after the * , we make the quantifier * lazy, so that the group stops catching as soon as the next group, which catches trailing special characters, catches something. We also start the regex with a ^ symbol and end it with an $ symbol to capture the entire string (they respectively set anchors to the start end the end of the string). With .exec(s)[1] we then execute the regex on the substring and return the first capturing group result in our transform function. Note that this might be null if a substring does not include proper characters. At the end we join the substrings with spaces.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM