简体   繁体   English

正则表达式,用于允许使用非ASCII字符的(类似twitter的)主题标签

[英]Regex for a (twitter-like) hashtag that allows non-ASCII characters

I want a regex to match a simple hashtag like that in twitter (eg #someword). 我想要一个正则表达式来匹配一个简单的#标签,如Twitter中的标签(例如#someword)。 I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese). 我还希望它能够识别非标准字符(例如西班牙语,希伯来语或中文字符)。

This was my initial regex: (^|\\s|\\b)(#(\\w+))\\b 这是我最初的正则表达式: (^|\\s|\\b)(#(\\w+))\\b
--> but it doesn't recognize non standard characters. ->但不能识别非标准字符。
Then, I tried using XRegExp.js , which worked, but ran too slowly. 然后,我尝试使用XRegExp.js ,它可以工作,但是运行太慢。

Any suggestions for how to do it? 有什么建议吗?

最终,我发现了这一点: twitter-text.js有用的链接,基本上是twitter解决此问题的方式。

With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example: 使用不支持unicode的本机JS正则表达式,您唯一的选择是显式枚举可以结束标记并与其他所有字符匹配的字符,例如:

> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]

The [\\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol. [\\s.,:,]应该包含空格,标点符号和其他可以视为终止符号的符号。

#([^#]+)[\\s,;]*

Explanation: This regular expression will search for a # followed by one or more non- # characters, followed by 0 or more spaces, commas or semicolons. 说明:此正则表达式将搜索#后跟一个或多个非#字符,后跟0个或多个空格,逗号或分号。

var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);

Result: 结果:

["#hasta ", "#mañana ", "#babהַ"]

EDIT - Replaced \\b for word boundary 编辑-用\\ b代替单词边界

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM