正则表达式以拉丁字母重音字符分割单词

Question

I'm working on a html tool to study ancient latin language. 我正在使用html工具来研究古代拉丁语。 There is one exercise where student have to click on some single word, in which there is a div with a piece of latin: 有一个练习，学生必须单击某个单词，其中有一个带有一段拉丁语的div ：

<div class="clickable">
                   Cum a Romanis copiis vincĭtur măr, Gallia terra fera est. 
Regionis incŏlae terram non colunt, autem sagittis feras necant et postea eas vorant. 
Etiam a_femĭnis vita agrestis agĭtur, 
miseras vestes induunt et cum familiā in parvis casis vivunt. 
Vita secūra nimiaeque divitiae a Gallis contemnuntur. 
Gallorum civitates acrĭter pugnant et ab inimicis copiis timentur. 
Galli densis silvis defenduntur, tamen Roma feram Galliam capit. 
</div>

In my javascript I wrap all single words into a <span> with a regex, and I apply some actions. 在我的JavaScript中，我将所有单个单词都用正则表达式包装到<span> ，然后执行一些操作。

 var words = $('div.clickable');        
    words.html(function(index, oldHtml) {
        var myText = oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>')

        return myText;
    }).click(function(event) { 
        if(!$(event.target).hasClass("word"))return; 
        alert($(event.target).text());
    }

The problem is that the words that contains ĭ, ŏ, ā , are not wrapped correctly, but are divided in correspondence of these characters. 问题在于包含ĭ, ŏ, ā的单词没有正确包装，而是按照这些字符的对应关系进行划分。

How I can match correctly this class of words? 我如何正确地匹配此类单词？

JS Fiddle JS小提琴

Answer 1

You can split your text by divider. 您可以通过分隔符分割文本。 In common case it may be space or different punctuation marks: 通常情况下，它可能是空格或不同的标点符号：

(.+?)([\s,.!?;:)([\]]+)

https://regex101.com/r/xW4pF1/5 https://regex101.com/r/xW4pF1/5

Edit 编辑

var words = $('div.clickable');        
words.html(function(index, oldHtml) {
    var myText = oldHtml.replace(/(.+?)([\s,.!?;:)([\]]+)/g, '<span class="word">$1</span>$2')

    return myText;
}).click(function(event) { 
    if(!$(event.target).hasClass("word"))return; 
    alert($(event.target).text());
}

https://jsfiddle.net/s568c0pp/3/ https://jsfiddle.net/s568c0pp/3/

Answer 2

The \\w meta character is used to find a word character from az , AZ , 0-9 , including the _ (underscore) character. \\w元字符用于从az ， AZ ， 0-9查找单词字符，包括_ （下划线）字符。 So you need to change your regex to use the range of Unicode symbols instead of \\w . 因此，您需要更改正则表达式以使用Unicode符号范围而不是\\w 。

You also can try \\p{L} instead of \\w to match any Unicode character. 您也可以尝试使用\\p{L}而不是\\w来匹配任何Unicode字符。

See also: http://www.regular-expressions.info/unicode.html 另请参阅： http : //www.regular-expressions.info/unicode.html

正则表达式以拉丁字母重音字符分割单词

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-04-04 07:09:49

解决方案2
1 2016-04-04 07:08:26

正则表达式以拉丁字母重音字符分割单词

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-04-04 07:09:49

解决方案2 1 2016-04-04 07:08:26

解决方案1
4 已采纳 2016-04-04 07:09:49

解决方案2
1 2016-04-04 07:08:26