简体   繁体   中英

Removing non-printable character

Okay, so I've been bashing my head against the table over this one.

I am importing an XML file that was exported by Indesign. This parses it and creates a file based on the input. (I'm building a JS application with Node)

This file looks good in my PHPStorm IDE. But when I open it in gedit, i see some unwanted newlines here and there.

I've managed to track it down to this character: -> <- (it really is there - copy it somewhere and move your cursor using the arrow keys over it. Its stuck in the middle).

This character viewed by a hex editor reveals it to be 0x80 0xE2 0xA9

When I tried to replace it using a simple javascript replace;

data = data.replace(' ', ''); //There IS a character in the left one. Trust me.

I got the following parse error;

在此处输入图片说明

In vim it shows the following character at that place; ~@

How am I going to remove that from my output? Escaping the character in the JS code caused it to compile just fine, but then the weird character is still there. I'm out of ideas.

You need to use '\
' as the search string. The sequence you are trying to replace is a "paragraph separator" Unicode character inserted by InDesign.

So:

string.replace('\u2029', '');

instead of the character itself.

String.replace() doesn't work exactly the way you think. The way you use it, it'll only replace the first occurrence:

> "abc abc abc".replace("a", "x");
'xbc abc abc'

You need to add the g (global) flag and the only standard way is to use regular expression as match:

> "abc abc abc".replace(/a/g, "x");
'xbc xbc xbc'

You can have a look at Fastest method to replace all instances of a character in a string for further ideas.


A search for 0x80 0xE2 0xA9 as UTF-8 shows the character doesn't exist but it's probably a mistype for 0xE2 0x80 0xA9 which corresponds to 'PARAGRAPH SEPARATOR' (U+2029) as Goran points out in his answer. You don't normally need to encode exotic characters as JavaScript \\u#### reference as long as all your tool-set is properly configured to use UTF-8 but, in this case, the JavaScript engine considers it a line feed and triggers a syntax error because you aren't allowed to have line feeds in JavaScript strings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM