简体   繁体   中英

Remove last occurence of invisible unicode character using regex?

I have string my string 󠀀 , there is an invisible character \\u{E0000} at the end of this string, I wanted to know how I can use regex to remove this character so that if I were to split the string using .split(' ') , it would say the length is 2 and not 3 which is what it is showing right now.

This is the regex I am currently using to remove the character, however when I split the string it still shows the length is 3 and not 2. The split would like look ['my', 'string'] .

.replace(/[\͏\⠀(\\u{E0000})\᠎\\ -\‍\]/gu, '');

The invisible character you have there is 2 code points, so you need to replace a sequence of 2 unicode escapes: \\u{e0000}\\u{dc00} .
However, you also seem to be misunderstanding the way split works. If you have a space at the end of the string, it will still try to split it into a separate element. See below example where there is no special character following:

 // removing the special character so the length of string is 10 with my string console.log( "my string 󠀀".length, "my string 󠀀".replace(/[\͏\⠀(\\u{e0000}\\u{dc00})\᠎\\ -\‍\]/gu, '') .length ); console.log( // use trim to remove trailing space so that it behaves the way you want "my string 󠀀".replace(/[\͏\⠀(\\u{e0000}\\u{dc00})\᠎\\ -\‍\]/gu, '') .trim().split(' ') ); // notice that it still tries to split the final into a 3rd element. console.log( //\  is the hex code for space ("my string" + "\ ").split(' ') );

Note that you may need to adjust your Regex. I haven't checked, but it is highly likely that the unicode characters you are using are not correct, and do not take into account multi-codepoint characters.

I've created a function below for extracting full escape sequences.

 var codePoints = (char, pos, end) => Array(char.length).fill(0).map((_,i)=>char.codePointAt(i)).slice(pos||0, end) //some code point values stop iterator; use length instead var escapeSequence = (codes, pos, end) => codePoints(codes, pos,end).map(p=>`\\\\u{${p.toString(16)}}`).join('') document.getElementById('btn').onclick=()=>{ const text = document.getElementById('text').value const start = +document.getElementById('start').value const end = document.getElementById('end').value||undefined document.getElementById('result').innerHTML = escapeSequence(text,start,end) } console.log( escapeSequence('1️⃣') ) console.log( escapeSequence("󠀀"), ) console.log( escapeSequence("my string 󠀀",10) )
 <label for="text">unicode text: </label><input type="text" id="text"><br> <label for="start">start position to retrieve from: </label><input type="number" id="start"><br> <label for="end">end position to retrieve from: </label><input type="number" id="end"><br> <button id="btn">get unicode escaped code points</button><br> <div id="result"></div>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM