简体   繁体   中英

Remove all spaces between Chinese words with regex

I would like to remove all spaces among Chinese text only .

My text: "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"

Ideal output: "請把這裡的 10 多個字合併. Can you help me?"

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
str = str.replace("/\ /", "");

I have studied a similar question for Python but it seems not to work in my situation so I brought my question here for some help.

Getting to the Chinese char matching pattern

Using the Unicode Tools , the \\p{Han} Unicode property class that matches any Chinese char can be translated into

[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\U00020000-\U0002A6D6\U0002A700-\U0002B734\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D]

In ES6, to match a single Chinese char, it can be used as

/[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\u{20000}-\u{2A6D6}\u{2A700}-\u{2B734}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}]/u

Transpiling it to ES5 using ES2015 Unicode regular expression transpiler , we get

(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])

pattern to match any Chinese char using JS RegExp .

So, you may use

s.replace(/([\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])\s+(?=(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D]))/g, '$1')

See the regex demo .

If your JS environment is ECMAScript 2018 compliant you may use a shorter

s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1')

Pattern details

  • (CHINESE_CHAR_PATTERN) - Capturing group 1 ( $1 in the replacement pattern): any Chinese char
  • \\s+ - any 1+ whitespaces (any Unicode whitespace)
  • (?=CHINESE_CHAR_PATTERN) - there must be a Chinese char immediately to the right of the current location.

JS demo :

 var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"; var HanChr = "[\\\⺀-\\\⺙\\\⺛-\\\⻳\\\⼀-\\\⿕\\\々\\\〇\\\〡-\\\〩\\\〸-\\\〻\\\㐀-\\\䶵\\\一-\\\鿯\\\豈-\\\舘\\\並-\\\龎]|[\\\?-\\\?\\\?-\\\?\\\?-\\\?\\\?-\\\?][\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?\\\?-\\\?]|\\\?[\\\?-\\\?]|\\\?[\\\?-\\\?]"; console.log(s.replace(new RegExp('(' + HanChr + ')\\\\s+(?=(?:' + HanChr + '))', 'g'), '$1')); 

A test for the regex compliant with the ECMAScript 2018 standard:

 var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"; console.log(s.replace(/(\\p{Script=Hani})\\s+(?=\\p{Script=Hani})/gu, '$1')); 

Using @Brett Zamir soluce on how to match chinese character in regex

Javascript unicode string, chinese character but no punctuation


 const str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; const regex = new RegExp('([\一-\鿌\㐀-\䶵\﨎\﨏\﨑\﨓\﨔\﨟\﨡\﨣\﨤\﨧-\﨩]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|\?[\?-\?]) ([\一-\鿌\㐀-\䶵\﨎\﨏\﨑\﨓\﨔\﨟\﨡\﨣\﨤\﨧-\﨩]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|[\?-\?][\?-\?]|\?[\?-\?\?-\?]|\?[\?-\?])* ', 'g'); const ret = str.replace(regex, '$1$2'); console.log(ret); 


It looks like :

([foo chinese chars]) ([foo chinese chars])*

Range for Chinese characters can be written as [\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌] so you can use this regex which selects a chinese character and a space and ensures it is followed by a chinese character by this look ahead (?=[\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌]+) ,

([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)\s+(?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)

And replace it by $1

Demo

 var str = '請 把把把把把 這 裡裡裡裡裡 的 10 多多多多 個 字 合 併. Can you help me?'; console.log(str.replace(/([\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌]+)\\s+(?=[\⺀-\⿕\㆐-\㆟\㐀-\䶿\一-\鿌]+)/g, "$1")); 

Try this

str.replace(/ ([\u4E00-\u9FCC])|([ -~]+ )/g, '$1$2');

Solution works witch ascii characters and chinsese letters with codes \一-\鿌 (I get them from here - it contains ~20000 chars enough for daily usage but not all Chinese letters).

 var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; str = str.replace(/ ([\一-\鿌])|([ -~]+ )/g, '$1$2'); console.log(str); 

Another solution use match() Method With chinsese letters codes /[\㐀-\龿]/ more details

str.match(/[\u3400-\u9FBF]/) // to detect if char is a chinese word

My Script to remove space between chinese char

 var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; //split the text by space var spl = chine.trim().split(/\\s+/); //Output spl = ["請","把","這",'裡','的','10','多','個'...]; var result = ''; for (var i = 0; i < spl.length; i++) { //check if the current char is a chinese word and the next char is a chinese word if true we remove space between them if (spl[i].match(/[\㐀-\龿]/) && spl[i+1].match(/[\㐀-\龿]/)) result += spl[i]; else result += spl[i] + ' '; //if the current char is not a chinese word we use space between them } console.log(result); 

  • Using map() Function instead for

 var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?'; var result = ''; chine.split(/\\s+/).map(function(item,i,elm) { if (item.match(/[\㐀-\龿]/) && elm[i+1].match(/[\㐀-\龿]/)) result += item; else result += item + ' '; }) console.log(result); 

This might be useful in your scenario. (?<![ -~]) (?![ -~])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM