[英]Regular expression to split words with accented characters from latin
I'm working on a html tool to study ancient latin language. 我正在使用html工具来研究古代拉丁语。 There is one exercise where student have to click on some single word, in which there is a
div
with a piece of latin: 有一个练习,学生必须单击某个单词,其中有一个带有一段拉丁语的
div
:
<div class="clickable">
Cum a Romanis copiis vincĭtur măr, Gallia terra fera est.
Regionis incŏlae terram non colunt, autem sagittis feras necant et postea eas vorant.
Etiam a_femĭnis vita agrestis agĭtur,
miseras vestes induunt et cum familiā in parvis casis vivunt.
Vita secūra nimiaeque divitiae a Gallis contemnuntur.
Gallorum civitates acrĭter pugnant et ab inimicis copiis timentur.
Galli densis silvis defenduntur, tamen Roma feram Galliam capit.
</div>
In my javascript I wrap all single words into a <span>
with a regex, and I apply some actions. 在我的JavaScript中,我将所有单个单词都用正则表达式包装到
<span>
,然后执行一些操作。
var words = $('div.clickable');
words.html(function(index, oldHtml) {
var myText = oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>')
return myText;
}).click(function(event) {
if(!$(event.target).hasClass("word"))return;
alert($(event.target).text());
}
The problem is that the words that contains ĭ, ŏ, ā
, are not wrapped correctly, but are divided in correspondence of these characters. 问题在于包含
ĭ, ŏ, ā
的单词没有正确包装,而是按照这些字符的对应关系进行划分。
How I can match correctly this class of words? 我如何正确地匹配此类单词?
You can split your text by divider. 您可以通过分隔符分割文本。 In common case it may be space or different punctuation marks:
通常情况下,它可能是空格或不同的标点符号:
(.+?)([\s,.!?;:)([\]]+)
https://regex101.com/r/xW4pF1/5 https://regex101.com/r/xW4pF1/5
Edit 编辑
var words = $('div.clickable');
words.html(function(index, oldHtml) {
var myText = oldHtml.replace(/(.+?)([\s,.!?;:)([\]]+)/g, '<span class="word">$1</span>$2')
return myText;
}).click(function(event) {
if(!$(event.target).hasClass("word"))return;
alert($(event.target).text());
}
https://jsfiddle.net/s568c0pp/3/ https://jsfiddle.net/s568c0pp/3/
The \\w
meta character is used to find a word character from az
, AZ
, 0-9
, including the _
(underscore) character. \\w
元字符用于从az
, AZ
, 0-9
查找单词字符,包括_
(下划线)字符。 So you need to change your regex to use the range of Unicode symbols instead of \\w
. 因此,您需要更改正则表达式以使用Unicode符号范围而不是
\\w
。
You also can try \\p{L}
instead of \\w
to match any Unicode character. 您也可以尝试使用
\\p{L}
而不是\\w
来匹配任何Unicode字符。
See also: http://www.regular-expressions.info/unicode.html 另请参阅: http : //www.regular-expressions.info/unicode.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.