用高/低代理計數Java中的單詞和字符？

Question

我知道有一些關於此主題的SO，但是提出的所有解決方案似乎都采用了與我在javascript中看到的示例不同的方法。

這是一個javascript示例，它計算文本字符串中鍵入的段落，句子單詞和字符，其中包括檢查高/低替代值以專門計算字符：

javascript版本

count(text);

function count(original) {
    var trimmed = original.replace(/[\u200B]+/, '').trim();
    return {
        paragraphs: trimmed ? (trimmed.match(/\n+/g) || []).length + 1 : 0,
        sentences: trimmed ? (trimmed.match(/[.?!…\n]+./g) || []).length + 1 : 0,
        words: trimmed ? (trimmed.replace(/['";:,.?¿\-!¡]+/g, '').match(/\S+/g) || []).length : 0,
        characters: trimmed ? _decode(trimmed.replace(/\s/g, '')).length : 0,
        all: _decode(original).length
    };
};

function _decode(string) {
    var output = [],
        counter = 0,
        length = string.length,
        value, extra;
    while (counter < length) {
        value = string.charCodeAt(counter++);
        if (value >= 0xD800 && value <= 0xDBFF && counter < length) {
            // High surrogate, and there is a next character.
            extra = string.charCodeAt(counter++);
            if ((extra & 0xFC00) === 0xDC00) {
                // Low surrogate.
                output.push(((value & 0x3FF) << 10) + (extra & 0x3FF) + 0x10000);
            } else {
                // unmatched surrogate; only append this code unit, in case the next
                // code unit is the high surrogate of a surrogate pair
                output.push(value, extra);
                counter--;
            }
        } else {
            output.push(value);
        }
    }
    return output;
}

下面和jsfiddle中的演示

 var text = 'This is a paragraph. This is the 2nd sentence in the 1st paragraph.\\nThis is another paragraph.'; var count = doCount(text); document.body.innerHTML = '<pre>' + text + '</pre><hr>'; for (i in count) { document.body.innerHTML += '<p>'+ i +': ' + count[i] + '</p>'; } /* COUNTING LIBRARY */ /** * Extracted from https://github.com/RadLikeWhoa/Countable/, which in * turn uses `ucs2decode` function from the punycode.js library. */ function doCount(original) { var trimmed = original.replace(/[\]+/, '').trim(); return { paragraphs: trimmed ? (trimmed.match(/\\n+/g) || []).length + 1 : 0, sentences: trimmed ? (trimmed.match(/[.?!…\\n]+./g) || []).length + 1 : 0, words: trimmed ? (trimmed.replace(/['";:,.?¿\\-!¡]+/g, '').match(/\\S+/g) || []).length : 0, characters: trimmed ? _decode(trimmed.replace(/\\s/g, '')).length : 0, all: _decode(original).length }; }; /** * `ucs2decode` function from the punycode.js library. * * Creates an array containing the decimal code points of each Unicode * character in the string. While JavaScript uses UCS-2 internally, this * function will convert a pair of surrogate halves (each of which UCS-2 * exposes as separate characters) into a single code point, matching * UTF-16. * * @see <http://goo.gl/8M09r> * @see <http://goo.gl/u4UUC> * * @param {String} string The Unicode input string (UCS-2). * * @return {Array} The new array of code points. */ function _decode(string) { var output = [], counter = 0, length = string.length, value, extra; while (counter < length) { value = string.charCodeAt(counter++); if (value >= 0xD800 && value <= 0xDBFF && counter < length) { // High surrogate, and there is a next character. extra = string.charCodeAt(counter++); if ((extra & 0xFC00) === 0xDC00) { // Low surrogate. output.push(((value & 0x3FF) << 10) + (extra & 0x3FF) + 0x10000); } else { // unmatched surrogate; only append this code unit, in case the next // code unit is the high surrogate of a surrogate pair output.push(value, extra); counter--; } } else { output.push(value); } } return output; }

我不熟悉字符編碼方案以及諸如高/低替代項之類的東西，但是使用Java進行計數時不需要嗎？

我對javascript實現的結果感到滿意，並且希望對Java后端進行計數，但是我不確定是否需要相同的方法或應該如何做。

Answer 1

因此，javascript版本的作用是，如果代理對出現在正在解碼的文本中，則將它們作為一個字符讀取。 這在Javascript中是可能的，因為根據Javascript引擎，UCS-2和UTF-16都被允許，而UTF-16支持高替代，這意味着單個可見字符被編碼為編碼點。 為了正確地計算長度，庫考慮了額外的代碼點，因此將它們計為一個。

在Java中，您有一個類似的問題，除了在Java中，您可以有更多的編碼方案。 幸運的是，Java已經為包含高替代項的String返回了正確的長度。 盡管如此，如果您想分離組合的代碼點甚至刪除它們，Java都提供了Normalizer （從文本中刪除變音符號的示例）。

string = Normalizer.normalize(string, Normalizer.Form.NFD);

用高/低代理計數Java中的單詞和字符？

問題描述

1 個解決方案

解決方案1
0 已采納 2018-01-23 12:51:27

用高/低代理計數Java中的單詞和字符？

問題描述

1 個解決方案

解決方案1 0 已采納 2018-01-23 12:51:27

解決方案1
0 已采納 2018-01-23 12:51:27