I would like to remove all spaces among Chinese text only .
My text: "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"
Ideal output: "請把這裡的 10 多個字合併. Can you help me?"
var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
str = str.replace("/\ /", "");
I have studied a similar question for Python but it seems not to work in my situation so I brought my question here for some help.
Getting to the Chinese char matching pattern
Using the Unicode Tools , the \\p{Han}
Unicode property class that matches any Chinese char can be translated into
[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\U00020000-\U0002A6D6\U0002A700-\U0002B734\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D]
In ES6, to match a single Chinese char, it can be used as
/[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\u{20000}-\u{2A6D6}\u{2A700}-\u{2B734}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}]/u
Transpiling it to ES5 using ES2015 Unicode regular expression transpiler , we get
(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])
pattern to match any Chinese char using JS RegExp
.
So, you may use
s.replace(/([\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])\s+(?=(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D]))/g, '$1')
See the regex demo .
If your JS environment is ECMAScript 2018 compliant you may use a shorter
s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1')
Pattern details
(CHINESE_CHAR_PATTERN)
- Capturing group 1 ( $1
in the replacement pattern): any Chinese char \\s+
- any 1+ whitespaces (any Unicode whitespace) (?=CHINESE_CHAR_PATTERN)
- there must be a Chinese char immediately to the right of the current location. JS demo :
var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"; var HanChr = "[\\\⺀-\\\⺙\\\⺛-\\\⻳\\\⼀-\\\⿕\\\々\\\〇\\\〡-\\\〩\\\〸-\\\〻\\\㐀-\\\䶵\\\一-\\\鿯\\\豈-\\\舘\\\並-\\\龎]|[\\\?-\\\?\\\?-\\\?\\\?-\\\?\\\?-\\\?][\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?]|\\\?[\\\?-\\\?]"; console.log(s.replace(new RegExp('(' + HanChr + ')\\\\s+(?=(?:' + HanChr + '))', 'g'), '$1'));
A test for the regex compliant with the ECMAScript 2018 standard:
var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"; console.log(s.replace(/(\\p{Script=Hani})\\s+(?=\\p{Script=Hani})/gu, '$1'));
Using @Brett Zamir soluce on how to match chinese character in regex
Javascript unicode string, chinese character but no punctuation
const str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; const regex = new RegExp('([\一-\鿌\㐀-\䶵\﨎\﨏\﨑\﨓\﨔\﨟\﨡\﨣\﨤\﨧-\﨩]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|\?[\?-\?]) ([\一-\鿌\㐀-\䶵\﨎\﨏\﨑\﨓\﨔\﨟\﨡\﨣\﨤\﨧-\﨩]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|\?[\?-\?])* ', 'g'); const ret = str.replace(regex, '$1$2'); console.log(ret);
It looks like :
([foo chinese chars]) ([foo chinese chars])*
Range for Chinese characters can be written as [\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌]
so you can use this regex which selects a chinese character and a space and ensures it is followed by a chinese character by this look ahead (?=[\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌]+)
,
([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)\s+(?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)
And replace it by $1
var str = '請 把把把把把 這 裡裡裡裡裡 的 10 多多多多 個 字 合 併. Can you help me?'; console.log(str.replace(/([\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌]+)\\s+(?=[\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌]+)/g, "$1"));
Try this
str.replace(/ ([\u4E00-\u9FCC])|([ -~]+ )/g, '$1$2');
Solution works witch ascii characters and chinsese letters with codes \一-\鿌 (I get them from here - it contains ~20000 chars enough for daily usage but not all Chinese letters).
var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; str = str.replace(/ ([\一-\鿌])|([ -~]+ )/g, '$1$2'); console.log(str);
Another solution use match() Method With chinsese letters codes /[\㐀-\龿]/
more details
str.match(/[\u3400-\u9FBF]/) // to detect if char is a chinese word
My Script to remove space between chinese char
var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; //split the text by space var spl = chine.trim().split(/\\s+/); //Output spl = ["請","把","這",'裡','的','10','多','個'...]; var result = ''; for (var i = 0; i < spl.length; i++) { //check if the current char is a chinese word and the next char is a chinese word if true we remove space between them if (spl[i].match(/[\㐀-\龿]/) && spl[i+1].match(/[\㐀-\龿]/)) result += spl[i]; else result += spl[i] + ' '; //if the current char is not a chinese word we use space between them } console.log(result);
var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; var result = ''; chine.split(/\\s+/).map(function(item,i,elm) { if (item.match(/[\㐀-\龿]/) && elm[i+1].match(/[\㐀-\龿]/)) result += item; else result += item + ' '; }) console.log(result);
This might be useful in your scenario. (?<![ -~]) (?![ -~])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.