标点符号的Javascript正则表达式（国际）？

Question

I need a regular expression to match against all punctuation marks, such as the standard [,!@#$%^&*()], but including international marks like the upside-down Spanish question mark, Chinese periods, etc. My google-fu is coming up short.我需要一个正则表达式来匹配所有标点符号，例如标准的 [,!@#$%^&*()]，但包括国际标记，如倒置的西班牙问号、中文句号等。我的谷歌-fu 快不行了。 Does anyone have such a regular expression on hand that's compatible with Javascript?有没有人手头有这样一个与 Javascript 兼容的正则表达式？

Answer 1

Adding to @stema's answer ( https://stackoverflow.com/a/7578937/114140 )... here is the regex as a string (so you don't need to bloat your project with XRegExp).添加到@stema 的答案（ https://stackoverflow.com/a/7578937/114140 ）...这里是作为字符串的正则表达式（因此您无需使用 XRegExp 使您的项目膨胀）。

!-#%-\x2A,-/:;\x3F@\x5B-\x5D_\x7B}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E3B\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65

I used this in my own project with some additions...我在我自己的项目中使用了它并添加了一些内容......

    // any kind of punctuation character (including international e.g. Chinese and Spanish punctuation)
    // author: http://www.regular-expressions.info/unicode.html
    // source: https://github.com/slevithan/xregexp/blob/41f4cd3fc0a8540c3c71969a0f81d1f00e9056a9/src/addons/unicode/unicode-categories.js#L142
    // note: XRegExp unicode output taken from http://jsbin.com/uFiNeDOn/3/edit?js,console (see chrome console.log), then converted back to JS escaped unicode here http://rishida.net/tools/conversion/, then tested on http://regexpal.com/
    // suggested by: https://stackoverflow.com/a/7578937
    // added: extra characters like "$", "\uFFE5" [yen symbol], "^", "+", "=" which are not consider punctuation in the XRegExp regex (they are currency or mathmatical characters)
    // added: \u3000-\u303F Chinese Punctuation for good measure
    var regex_characters_to_remove = /[\$\uFFE5\^\+=`~<>{}\[\]|\u3000-\u303F!-#%-\x2A,-/:;\x3F@\x5B-\x5D_\x7B}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E3B\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]+/g

Answer 2

If it's possible for you to use a plugin, there is a plugin for JavaScript: XRegExp Unicode plugins .如果您可以使用插件，那么有一个 JavaScript 插件： XRegExp Unicode plugins 。 That adds support for Unicode categories, scripts, and blocks (I personally have only read about it, I never used it).这增加了对 Unicode 类别、脚本和块的支持（我个人只读过它，我从未使用过它）。

With this plugin it should be possible to use Unicode categories like \\p{P} as explained at regular-expressions.info .使用此插件，应该可以使用 Unicode 类别，如正则表达式.info 中所述的\\p{P} 。

Update: OK, I tested it, and it seems to work fine.更新：好的，我测试了它，它似乎工作正常。

You need to get the lib from XRegExp and additionally the Unicode Base and Unicode Category plugins (linked above).您需要从XRegExp获取 lib 以及另外的 Unicode Base 和 Unicode Category 插件（链接在上面）。

<script src="xregexp.js"></script>
<script src="addons/unicode-base.js"></script>
<script src="addons/unicode-categories.js"></script>
<script>
    var unicodePunctuation = XRegExp("^\\p{P}+$");

    alert(unicodePunctuation.test("?.,;!¡¿。、·")); // true
</script>

The above alerts true .以上警告为true 。 I included some Spanish and Chinese punctuation in my test string, "?.,;!¡¿。、·" .我在测试字符串中包含了一些西班牙语和中文标点符号， "?.,;!¡¿。、·" 。

Answer 3

嗯......不知道它会有多广泛，但你可以使用这个：

[^\w\s\n\t]

Answer 4

Your regex would look something like...你的正则表达式看起来像......

/[,!@#$%^&*()\u9999]/

Where you replace each \香 with the Unicode codepoint for the other punctuation characters.您将每个\香替换为其他标点字符的 Unicode 代码点。

If you could find a bunch in a range , you could specify that with the - range operand, eg \馐-\香 .如果你能找到的范围内的一堆，你可以指定与-范围内操作，如\馐-\香 。

As far as I know you can't use something like \\pP in JavaScript regexes.据我所知，你不能在 JavaScript 正则表达式中使用类似\\pP东西。

Answer 5

For Python this regex to remove from the start and end any type of punctuation marks:对于 Python，此正则表达式从开头和结尾删除任何类型的标点符号：

import re
def cleanspecialcharacters(str):   
    regex = re.compile((
    '^[/\"_\(\)&*\$￥\^\+=`~<>\{\}\[\]\|\-!#%\,\:;@¡§«¶·»¿;·՚-՟։֊؉،॥॰෴๏๚๛༄-༒༔༺-༽྅჻፠-፨᐀᙭᙮។-៖៘-៚‧‰-⁃⁅-⁑⁓-⁞⁽⁾₍₎、〃〈-【】〔-〟〰〽゠・﴾﴿︐-︙︰-﹒﹔-﹡﹣﹨﹪﹫！-＃％-＊，-／：；？＠［-］＿｛｝｟-･〔〕《》]*|'
    '([/\"_\(\)&*\$￥\^\+=`~<>\{\}\[\]\|\-!#%\,\:;@¡§«¶·»¿;·՚-՟։֊؉،॥॰෴๏๚๛༄-༒༔༺-༽྅჻፠-፨᐀᙭᙮។-៖៘-៚‧‰-⁃⁅-⁑⁓-⁞⁽⁾₍₎、〃〈-【】〔-〟〰〽゠・﴾﴿︐-︙︰-﹒﹔-﹡﹣﹨﹪﹫！-＃％-＊，-／：；？＠［-］＿｛｝｟-･〔〕《》])*$'))
    str = regex.sub('', str)
    return str

Answer 6

From ES 2018, Unicode property escapes are supported.从 ES 2018 开始，支持Unicode 属性转义。 You can use \\p{Punctuation} or just \\p{P} (the same as the XRegExp answer) to match any punctuation character (by the Unicode definition), or \\P{Punctuation} to match any non-punctuation character.您可以使用\\p{Punctuation}或仅使用\\p{P} （与 XRegExp 答案相同）来匹配任何标点字符（根据 Unicode 定义），或使用\\P{Punctuation}来匹配任何非标点字符。

If you want to match any "non-word" character, like a Unicode version of \\W , you can try something like:如果您想匹配任何“非单词”字符，例如\\W的 Unicode 版本，您可以尝试以下操作：

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

(as recommended in the proposal for the feature). （如该功能的提案中所建议的那样）。 You might want to remove \\p{Connector_Punctuation} , since that includes underscores and similar.您可能想要删除\\p{Connector_Punctuation} ，因为它包含下划线和类似内容。

Don't forget to add the u flag to your regular expression to make it Unicode-aware and enable this feature.不要忘记将u标志添加到您的正则表达式以使其能够识别 Unicode 并启用此功能。

标点符号的Javascript正则表达式（国际）？

问题描述

6 个解决方案

解决方案1
9 2014-01-28 03:57:14

解决方案2
8 已采纳 2011-09-28 05:58:52

解决方案3
3 2011-09-28 00:03:32

解决方案4
2 2011-09-28 00:04:51

解决方案5
0 2015-03-24 06:59:56

解决方案6
0 2020-09-18 08:29:29

标点符号的Javascript正则表达式（国际）？

问题描述

6 个解决方案

解决方案1 9 2014-01-28 03:57:14

解决方案2 8 已采纳 2011-09-28 05:58:52

解决方案3 3 2011-09-28 00:03:32

解决方案4 2 2011-09-28 00:04:51

解决方案5 0 2015-03-24 06:59:56

解决方案6 0 2020-09-18 08:29:29

解决方案1
9 2014-01-28 03:57:14

解决方案2
8 已采纳 2011-09-28 05:58:52

解决方案3
3 2011-09-28 00:03:32

解决方案4
2 2011-09-28 00:04:51

解决方案5
0 2015-03-24 06:59:56

解决方案6
0 2020-09-18 08:29:29