[英]JavaScript Regex to Match Boundaries of Words with diacritics
我必須在文本文檔中匹配具有變音符號的單詞的單詞邊界。 給定一個word
標記,我的正則表達式看起來像
var wordRegex = new RegExp("\\b(" + word + ")\\b", "g");
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(token) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + token.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(token, wordEnd);
break;
}
}
這適用於ciao
、 casa
等普通單詞。 但是當我在文本中使用però
、 così
等單詞però
, così
。
const seen = new Map(); var text = "Ci son macchine nascoste e, però, nascoste male" var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i) tokens.forEach((token, tokenIndex) => { var item = { "index": (tokenIndex + 1), "word": token } var escaped = token.replace(/[\\-\\[\\]{}()*+?.,\\\\\\^$|#\\s]/g, "\\\\$&"); var wordRegex = new RegExp("\\\\b(" + escaped + ")\\\\b", "g"); var match = null; console.log(token, "---->", wordRegex) while ((match = wordRegex.exec(text)) !== null) { console.log("\\t---->", match.index) if (match.index > (seen.get(token) || -1)) { var wordStart = match.index; var wordEnd = wordStart + token.length - 1; item.characterOffsetBegin = wordStart; item.characterOffsetEnd = wordEnd; seen.set(token, wordEnd); break; } } })
您可以看到某些單詞(如macchine
或nascoste
)如何匹配,所以我得到了match.index
,對於其他單詞(如però
),正則表達式無法正常工作並且match
變量為null
:
macchine ----> /\b(macchine)\b/g
----> 7
nascoste ----> /\b(nascoste)\b/g
----> 16
e, ----> /\b(e\,)\b/g
però, ----> /\b(però\,)\b/g
nascoste ----> /\b(nascoste)\b/g
----> 16
----> 34
那么如何編寫一個支持變音符號的邊界正則表達式呢?
[更新]按照評論中建議的方法,在應用Regex
之前,我對每個單詞token
使用了變音符號刪除,然后對整個text
進行了刪除,例如:
var normalizedText = removeDiacritics(text);
// for each token...
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
escaped = removeDiacritics(escaped);
var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
var match = null;
while ((match = wordRegex.exec( normalizedText )) !== null)
{
//...
這一次我將獲得由\\b
單詞邊界捕獲的帶有重音的單詞。 當然,這種方法不是最優的,因為必須對每個標記應用removeDiacritics
,因此最好的解決方案是執行一次。
這是我們在評論中提出的解決方案,用於將帶有變音符號的單詞映射到文本中的索引:
function removeDiacritics(text) { return _.deburr(text) } const seen = new Map(); var text = "Ci son macchine nascoste e, però, nascoste male" var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i) var normalizedText = removeDiacritics(text) tokens.forEach((token, tokenIndex) => { var item = { "index": (tokenIndex + 1), "word": removeDiacritics(token) } var escaped = token.replace(/[\\-\\[\\]{}()*+?.,\\\\\\^$|#\\s]/g, "\\\\$&"); escaped = removeDiacritics(escaped) var wordRegex = new RegExp("\\\\b(" + escaped + ")\\\\b", "g"); var match = null; console.log(token, "---->", wordRegex) while ((match = wordRegex.exec(normalizedText)) !== null) { console.log("\\t---->", match.index) if (match.index > (seen.get(token) || -1)) { var wordStart = match.index; var wordEnd = wordStart + token.length - 1; item.characterOffsetBegin = wordStart; item.characterOffsetEnd = wordEnd; seen.set(token, wordEnd); break; } } })
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.11/lodash.min.js"></script>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.