简体   繁体   English

JavaScript 正则表达式匹配单词边界与变音符号

[英]JavaScript Regex to Match Boundaries of Words with diacritics

I have to match in a text document, words boundaries for words having diacritics.我必须在文本文档中匹配具有变音符号的单词的单词边界。 Given a word token, my regex looks like给定一个word标记,我的正则表达式看起来像

var wordRegex = new RegExp("\\b(" + word + ")\\b", "g");
while ((match = wordRegex.exec(text)) !== null) {
                            if (match.index > (seen.get(token) || -1)) {
                                var wordStart = match.index;
                                var wordEnd = wordStart + token.length - 1;
                                item.characterOffsetBegin = wordStart;
                                item.characterOffsetEnd = wordEnd;

                                seen.set(token, wordEnd);
                                break;
                            }
                        }

This works ok for ordinary words like ciao , casa , etc. But it will not works when I have in the text words like però , così , etc.这适用于ciaocasa等普通单词。 但是当我在文本中使用peròcosì等单词peròcosì

 const seen = new Map(); var text = "Ci son macchine nascoste e, però, nascoste male" var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i) tokens.forEach((token, tokenIndex) => { var item = { "index": (tokenIndex + 1), "word": token } var escaped = token.replace(/[\\-\\[\\]{}()*+?.,\\\\\\^$|#\\s]/g, "\\\\$&"); var wordRegex = new RegExp("\\\\b(" + escaped + ")\\\\b", "g"); var match = null; console.log(token, "---->", wordRegex) while ((match = wordRegex.exec(text)) !== null) { console.log("\\t---->", match.index) if (match.index > (seen.get(token) || -1)) { var wordStart = match.index; var wordEnd = wordStart + token.length - 1; item.characterOffsetBegin = wordStart; item.characterOffsetEnd = wordEnd; seen.set(token, wordEnd); break; } } })

You can see how while some words (like macchine or nascoste ) it matches, so I get the match.index , for other words (like però ) the regex does not work properly and the match variable is null :您可以看到某些单词(如macchinenascoste )如何匹配,所以我得到了match.index ,对于其他单词(如però ),正则表达式无法正常工作并且match变量为null

macchine ----> /\b(macchine)\b/g
    ----> 7
nascoste ----> /\b(nascoste)\b/g
    ----> 16
e, ----> /\b(e\,)\b/g
però, ----> /\b(però\,)\b/g
nascoste ----> /\b(nascoste)\b/g
    ----> 16
    ----> 34

How to write a boundary regex that supports diacritics too then?那么如何编写一个支持变音符号的边界正则表达式呢?

[UPDATE] Following the approach suggested in the comments, I have used diacritics removal for each word token before applying the Regex , and then to the whole text like: [更新]按照评论中建议的方法,在应用Regex之前,我对每个单词token使用了变音符号删除,然后对整个text进行了删除,例如:

var normalizedText = removeDiacritics(text);
// for each token...
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
escaped = removeDiacritics(escaped);
var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
var match = null;
while ((match = wordRegex.exec( normalizedText )) !== null) 
{
                             //...

and this time I will get the words with accents captured by the \\b word boundaries.这一次我将获得由\\b单词边界捕获的带有重音的单词。 Of course this approach is not optimal, because the removeDiacritics must be applied for every token, so the best solution would be to do this once.当然,这种方法不是最优的,因为必须对每个标记应用removeDiacritics ,因此最好的解决方案是执行一次。

This is the solution we came up with in the comments to map words having diacritics to their index in the text:这是我们在评论中提出的解决方案,用于将带有变音符号的单词映射到文本中的索引:

 function removeDiacritics(text) { return _.deburr(text) } const seen = new Map(); var text = "Ci son macchine nascoste e, però, nascoste male" var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i) var normalizedText = removeDiacritics(text) tokens.forEach((token, tokenIndex) => { var item = { "index": (tokenIndex + 1), "word": removeDiacritics(token) } var escaped = token.replace(/[\\-\\[\\]{}()*+?.,\\\\\\^$|#\\s]/g, "\\\\$&"); escaped = removeDiacritics(escaped) var wordRegex = new RegExp("\\\\b(" + escaped + ")\\\\b", "g"); var match = null; console.log(token, "---->", wordRegex) while ((match = wordRegex.exec(normalizedText)) !== null) { console.log("\\t---->", match.index) if (match.index > (seen.get(token) || -1)) { var wordStart = match.index; var wordEnd = wordStart + token.length - 1; item.characterOffsetBegin = wordStart; item.characterOffsetEnd = wordEnd; seen.set(token, wordEnd); break; } } })
 <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.11/lodash.min.js"></script>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM