简体   繁体   English

正则表达式以拉丁字母重音字符分割单词

[英]Regular expression to split words with accented characters from latin

I'm working on a html tool to study ancient latin language. 我正在使用html工具来研究古代拉丁语。 There is one exercise where student have to click on some single word, in which there is a div with a piece of latin: 有一个练习,学生必须单击某个单词,其中有一个带有一段拉丁语的div

<div class="clickable">
                   Cum a Romanis copiis vincĭtur măr, Gallia terra fera est. 
Regionis incŏlae terram non colunt, autem sagittis feras necant et postea eas vorant. 
Etiam a_femĭnis vita agrestis agĭtur, 
miseras vestes induunt et cum familiā in parvis casis vivunt. 
Vita secūra nimiaeque divitiae a Gallis contemnuntur. 
Gallorum civitates acrĭter pugnant et ab inimicis copiis timentur. 
Galli densis silvis defenduntur, tamen Roma feram Galliam capit. 
</div>    

In my javascript I wrap all single words into a <span> with a regex, and I apply some actions. 在我的JavaScript中,我将所有单个单词都用正则表达式包装到<span> ,然后执行一些操作。

 var words = $('div.clickable');        
    words.html(function(index, oldHtml) {
        var myText = oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>')

        return myText;
    }).click(function(event) { 
        if(!$(event.target).hasClass("word"))return; 
        alert($(event.target).text());
    }

The problem is that the words that contains ĭ, ŏ, ā , are not wrapped correctly, but are divided in correspondence of these characters. 问题在于包含ĭ, ŏ, ā的单词没有正确包装,而是按照这些字符的对应关系进行划分。

How I can match correctly this class of words? 我如何正确地匹配此类单词?

JS Fiddle JS小提琴

You can split your text by divider. 您可以通过分隔符分割文本。 In common case it may be space or different punctuation marks: 通常情况下,它可能是空格或不同的标点符号:

(.+?)([\s,.!?;:)([\]]+)

https://regex101.com/r/xW4pF1/5 https://regex101.com/r/xW4pF1/5

Edit 编辑

var words = $('div.clickable');        
words.html(function(index, oldHtml) {
    var myText = oldHtml.replace(/(.+?)([\s,.!?;:)([\]]+)/g, '<span class="word">$1</span>$2')

    return myText;
}).click(function(event) { 
    if(!$(event.target).hasClass("word"))return; 
    alert($(event.target).text());
}

https://jsfiddle.net/s568c0pp/3/ https://jsfiddle.net/s568c0pp/3/

The \\w meta character is used to find a word character from az , AZ , 0-9 , including the _ (underscore) character. \\w元字符用于从azAZ0-9查找单词字符,包括_ (下划线)字符。 So you need to change your regex to use the range of Unicode symbols instead of \\w . 因此,您需要更改正则表达式以使用Unicode符号范围而不是\\w

You also can try \\p{L} instead of \\w to match any Unicode character. 您也可以尝试使用\\p{L}而不是\\w来匹配任何Unicode字符。

See also: http://www.regular-expressions.info/unicode.html 另请参阅: http : //www.regular-expressions.info/unicode.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM