简体   繁体   English

如何过滤掉不是字母、数字或标点符号的字符

[英]How to filter out characters that aren't letters, numbers or punctuation

I have a string that will have a lot of formatting things like bullet points or arrows or whatever.我有一个字符串,它将包含很多格式内容,例如项目符号或箭头或其他任何内容。 I want to clean this string so that it only contains letters, numbers and punctuation.我想清理这个字符串,使其只包含字母、数字和标点符号。 Multiple spaces should be replaced by a single space too.多个空格也应替换为单个空格。

Allowed punctuation: , . : ; [ ] ( ) / \\ ! @ # $ % ^ & * + - _ { } < > = ? ~ | "允许的标点符号: , . : ; [ ] ( ) / \\ ! @ # $ % ^ & * + - _ { } < > = ? ~ | " , . : ; [ ] ( ) / \\ ! @ # $ % ^ & * + - _ { } < > = ? ~ | "

Basically anything allowed in this ASCII table.基本上这个ASCII 表中允许的任何内容。

This is what I have so far:这是我到目前为止:

let asciiOnly = y.replace(/[^a-zA-Z0-9\s]+/gm, '')
let withoutSpacing = asciiOnly.replace(/\s{2,}/gm, ' ')

Regex101: https://regex101.com/r/0DC1tz/2 Regex101: https ://regex101.com/r/0DC1tz/2

I also tried the [:punct:] tag but apparently it's not supported by javascript.我也试过[:punct:]标签,但显然它不受 javascript 支持。 Is there a better way I can clean this string other than regex?除了正则表达式,有没有更好的方法可以清理这个字符串? A library or something maybe (I didn't find any).也许是图书馆或其他东西(我没有找到)。 If not, how would I do this with regex?如果没有,我将如何用正则表达式做到这一点? Would I have to edit the first regex to add every single character of punctuation?我是否必须编辑第一个正则表达式以添加标点符号的每个字符?

EDIT: I'm trying to paste an example string in the question but SO just removes characters it doesn't recognize so it looks like a normal string.编辑:我试图在问题中粘贴一个示例字符串,但 SO 只是删除了它无法识别的字符,因此它看起来像一个普通字符串。 Heres a paste .这是一个粘贴

EDIT2: I think this is what I needed: EDIT2:我认为这就是我需要的:

let asciiOnly = x.replace(/[^\x20-\x7E]+/gm, '')
let withoutSpacing = asciiOnly.replace(/\s{2,}/gm, ' ')

I'm testing it with different cases to make sure.我正在用不同的情况对其进行测试以确保。

You can achieve this using below regex, which finds any non-ascii characters (also excludes non-printable ascii characters and excluding extended ascii too) and removes it with empty string.您可以使用下面的正则表达式来实现这一点,它可以找到任何非 ascii 字符(也排除不可打印的 ascii 字符并排除扩展的 ascii)并用空字符串将其删除。

[^ -~]+

This is assuming you want to retain all printable ASCII characters only, which range from space (ascii value 32) to tilde ~ hence usage of this char set [^ !-~]这是假设您只想保留所有可打印的 ASCII 字符,范围从空格(ascii 值 32)到波浪号~因此使用此字符集[^ !-~]

And then replaces all one or more white space with a single space然后用一个空格替换所有一个或多个空格

 var str = `Determine the values of P∞ and E∞ for each of the following signals: bdf Periodic and aperiodic signals Determine whether or not each of the following signals is periodic: b. Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period. bd Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals: bcdef Figure 1: Problem Set 1.4 Even and Odd Signals For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero. bd -------------------------`; console.log(str.replace(/[^ -~]+/g,'').replace(/\\s+/g, ' ')); <!-- begin snippet: js hide: false console: true babel: false -->

console.log(str.replace(/[^ !-~]+/g,'').replace(/\s+/g, ' '));

Also, if you just want to allow all alphanumeric characters and mentioned special characters, then you can use this regex to first retain all needed characters using this regex ,此外,如果您只想允许所有字母数字字符和提到的特殊字符,那么您可以使用此正则表达式首先使用此正则表达式保留所有需要的字符,

[^ a-zA-Z0-9,.:;[\]()/\!@#$%^&*+_{}<>=?~|"-]+

Replace this with empty string and then replace one or more white spaces with just a single space.用空字符串替换它,然后用一个空格替换一个或多个空格。

 var str = `Determine the values of P∞ and E∞ for each of the following signals: bdf Periodic and aperiodic signals Determine whether or not each of the following signals is periodic: b. Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period. bd Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals: bcdef Figure 1: Problem Set 1.4 Even and Odd Signals For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero. bd -------------------------`; console.log(str.replace(/[^ a-zA-Z0-9,.:;[\\]()/\\!@#$%^&*+_{}<>=?~|"-]+/g,'').replace(/\\s+/g, ' '));

This is how i will do.这就是我要做的。 I will remove the all the non allowed character first and than replace the multiple spaces with a single space.我将首先删除所有不允许的字符,然后用一个空格替换多个空格。

 let str = `Determine the values of P∞ and E∞ for each of the following signals: bdf Periodic and aperiodic signals Determine whether or not each of the following signals is periodic:!!!23 b. Determine whether or not each of the following signals is periodic. If a signal is periodic, specify its fundamental period. bd Transformation of Independent variables A continuous-time signal x(t) is shown in Figure 1. Sketch and label carefully each of the following signals: bcdef Figure 1: Problem Set 1.4 Even and Odd Signals For each signal given below, determine all the values of the independent variable at which the even part of the signal is guaranteed to be zero. bd ------------------------- ` const op = str.replace(/[^\\w,.:;\\[\\]()/\\!@#$%^&*+{}<>=?~|" -]/g, '').replace(/\\s+/g, " ") console.log(op)

EDIT : In case you want to keep \\n or \\t as it is use (\\s)\\1+, "$1" in second regex.编辑:如果您想在第二个正则表达式中使用(\\s)\\1+, "$1"保留\\n\\t

  • There probably isn't a better solution than a regex.可能没有比正则表达式更好的解决方案。 The under-the-hood implementation of regex actions is usually well optimized by virtue of age and ubiquity.由于年龄和普遍性,正则表达式操作的幕后实现通常得到了很好的优化。
  • You may be able to explicitly tell the regex handler to "compile" the regex.可以明确告诉正则表达式处理程序“编译”正则表达式。 This is usually a good idea if you know the regex is going to be used a lot within a program, and may help with performance here.如果您知道正则表达式将在程序中大量使用,这通常是一个好主意,并且可能有助于提高这里的性能。 But I don't know if javascript exposes such an option.但我不知道 javascript 是否公开了这样的选项。
  • The idea of "normal punctuation" doesn't have an excellent foundation. “正常标点符号”的想法没有很好的基础。 There are some common marks like "90°" that aren't ASCII, and some ASCII marks like "?"有一些常见的标记,如“90°”不是 ASCII,还有一些 ASCII 标记,如“?” ( &#127; ) that you almost certainly don't want. ( &#127; ) 你几乎肯定想要。 I would expect you to find similar edge cases with any pre-made list.我希望您能在任何预制列表中找到类似的边缘情况。 In any case, just explicitly listing all the punctuation you want to allow is better in general , because then no one will ever have to look up what's in the list you chose.在任何情况下,通常只明确列出您想要允许的所有标点符号会更好,因为这样就没有人需要查找您选择的列表中的内容。
  • You may be able to perform both substitutions in a single pass, but it's unclear if that will perform better and it almost certainly won't be clearer to any co-workers (including yourself-from-the-future).可能可以在一次传递中执行两个替换,但尚不清楚这是否会表现得更好,而且几乎可以肯定,任何同事(包括未来的您自己)都不会更清楚。 There will be a lot of finicky details to work out such as whether " ° " should be replaced with "" , " " , or " " .将有很多挑剔的细节需要解决,例如" ° "是否应该替换为""" "" "

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM