简体   繁体   English

删除字符串中的非ascii字符

[英]Remove non-ascii character in string

var str="INFO] :谷���新道, ひば���ヶ丘2丁���, ひばりヶ���, 東久留米市 (Higashikurume)";

and i need to remove all non-ascii character from string, 我需要从字符串中删除所有非ascii字符,

means str only contain "INFO] (Higashikurume)"; 意味着str只包含“INFO](Higashikurume)”;

ASCII的范围是0到127,所以:

str.replace(/[^\x00-\x7F]/g, "");

It can also be done with a positive assertion of removal, like this: 它也可以通过正面的删除声明来完成,如下所示:

textContent = textContent.replace(/[\u{0080}-\u{FFFF}]/gu,"");

This uses unicode. 这使用unicode。 In Javascript, when expressing unicode for a regular expression, the characters are specified with the escape sequence \\u{xxxx} but also the flag 'u' must present; 在Javascript中,当表达正则表达式的unicode时,字符用转义序列\\u{xxxx}指定,但标志'u'必须存在; note the regex has flags 'gu' . 注意正则表达式有标志'gu'

I called this a "positive assertion of removal" in the sense that a "positive" assertion expresses which characters to remove, while a "negative" assertion expresses which letters to not remove. 我称这是一个“正面的删除断言”,意思是“正面”断言表示要删除哪些字符,而“否定”断言则表示哪些字母不能删除。 In many contexts, the negative assertion, as stated in the prior answers, might be more suggestive to the reader. 在许多情况下,如先前答案中所述,否定主张可能对读者更具启发性。 The circumflex " ^ " says "not" and the range \\x00-\\x7F says "ascii," so the two together say "not ascii." \\x00-\\x7F^ ”表示“不”,范围\\x00-\\x7F表示“ascii”,因此两者一起说“不是ascii”。

textContent = textContent.replace(/[^\x00-\x7F]/g,"");

That's a great solution for English language speakers who only care about the English language, and its also a fine answer for the original question. 对于只关心英语的英语使用者而言,这是一个很好的解决方案,对于原始问题也是一个很好的答案。 But in a more general context, one cannot always accept the cultural bias of assuming "all non-ascii is bad." 但在更一般的背景下,人们不能总是接受假设“所有非ascii都不好”的文化偏见。 For contexts where non-ascii is used, but occasionally needs to be stripped out, the positive assertion of Unicode is a better fit. 对于使用非ascii但偶尔需要删除的上下文,Unicode的正面断言更适合。

A good indication that zero-width, non printing characters are embedded in a string is when the string's "length" property is positive (nonzero), but looks like (ie prints as) an empty string. 字符串中嵌入零宽度非打印字符的一个很好的指示是当字符串的“长度”属性为正(非零)时,但看起来像(即打印为)空字符串。 For example, I had this showing up in the Chrome debugger, for a variable named "textContent": 例如,我在Chrome调试器中显示了一个名为“textContent”的变量:

> textContent
""
> textContent.length
7

This prompted me to want to see what was in that string. 这促使我想要查看该字符串中的内容。

> encodeURI(textContent)
"%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B%E2%80%8B"

This sequence of bytes seems to be in the family of some Unicode characters that get inserted by word processors into documents, and then find their way into data fields. 这个字节序列似乎是一些Unicode字符的族,它们被字处理器插入到文档中,然后进入数据字段。 Most commonly, these symbols occur at the end of a document. 最常见的是,这些符号出现在文档的末尾。 The zero-width-space "%E2%80%8B" might be inserted by CK-Editor (CKEditor). CK-Editor(CKEditor)可以插入零宽度空间"%E2%80%8B"

encodeURI()  UTF-8     Unicode  html     Meaning
-----------  --------  -------  -------  -------------------
"%E2%80%8B"  EC 80 8B  U 200B   ​  zero-width-space
"%E2%80%8E"  EC 80 8E  U 200E   ‎  left-to-right-mark
"%E2%80%8F"  EC 80 8F  U 200F   ‏  right-to-left-mark

Some references on those: 一些参考文献:

http://www.fileformat.info/info/unicode/char/200B/index.htm http://www.fileformat.info/info/unicode/char/200B/index.htm

https://en.wikipedia.org/wiki/Left-to-right_mark https://en.wikipedia.org/wiki/Left-to-right_mark

Note that although the encoding of the embedded character is UTF-8, the encoding in the regular expression is not. 请注意,虽然嵌入字符的编码是UTF-8,但正则表达式中的编码不是。 Although the character is embedded in the string as three bytes (in my case) of UTF-8, the instructions in the regular expression must use the two-byte Unicode. 尽管字符在字符串中嵌入了UTF-8的三个字节(在我的例子中),但正则表达式中的指令必须使用双字节Unicode。 In fact, UTF-8 can be up to four bytes long; 实际上,UTF-8最长可达4个字节; it is less compact than Unicode because it uses the high bit (or bits) to escape the standard ascii encoding. 它不如Unicode那么紧凑,因为它使用高位(或位)来逃避标准的ascii编码。 That's explained here: 这在这里解释:

https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/UTF-8

You can use the following regex to replace non-ASCII characters 您可以使用以下正则表达式替换非ASCII字符

str = str.replace(/[^A-Za-z 0-9 \.,\?""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]`~]*/g, '')

However, note that spaces, colons and commas are all valid ASCII, so the result will be 但请注意,空格,冒号和逗号都是有效的ASCII,因此结果将是

> str
"INFO] :, , ,  (Higashikurume)"

要使用带重音的ASCII:

var str = str.replace(/[^\x00-\xFF]/g, "");

None of these answers properly handle tabs, newlines, carriage returns, and some don't handle extended ASCII and unicode. 这些答案都没有正确处理选项卡,换行符,回车符,有些不处理扩展的ASCII和unicode。 This will KEEP tabs & newlines, but remove control characters and anything out of the ASCII set. 这将保留选项卡和换行符,但删除控制字符和ASCII集之外的任何内容。 Click "Run this code snippet" button to test. 单击“运行此代码段”按钮进行测试。 There is some new javascript coming down the pipe so in the future (2020+?) you may have to do \\u{FFFFF} but not yet 有一些新的javascript下来管道,所以在未来(2020+?)你可能不得不做\\u{FFFFF}但还没有

 console.log("line 1\\nline2 \\n\\ttabbed\\nF̸̡̢͓̳̜̪̟̳̠̻̖͐̂̍̅̔̂͋͂͐l̸̢̹̣̤̙͚̱͓̖̹̻̣͇͗͂̃̈͝a̸̢̡̬͕͕̰̖͍̮̪̬̍̏̎̕͘ͅv̸̢̛̠̟̄̿i̵̮͌̑ǫ̶̖͓͎̝͈̰̹̫͚͓̠̜̓̈́̇̆̑͜ͅ".replace(/[\\x00-\\x08\\x0E-\\x1F\\x7F-\￿]/g, '')) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM