简体   繁体   中英

Inconsistencies in caret position, string length and matches index in C#

I am trying to get the currently selected word in a Scintilla textbox using Regex, and I am noticing some inconsistencies between the reported string length, the matches index and the caret position or start of the selection:

private KeyValuePair<int, string> get_current_word()
{
    int cur_pos = scin_txt.Selection.Start;
    KeyValuePair<int, string> kvp_word = new KeyValuePair<int, string>(0, "");
    MatchCollection words = Regex.Matches(scin_txt.Text, @"\b(?<word>\w+)\b");
    foreach (Match word in words)
    {
        int start = word.Index;
        int end = start + word.Length;
        if (start <= cur_pos && cur_pos <= end)
        {
            kvp_word = new KeyValuePair<int,string>(start, word.Value);
            break;
        }
    }
    return kvp_word;
}

In short, I am splitting the string in words and using the matches indexes to see if the caret is currently contained within the word.

Unfortunately, the numbers don't seem to match properly:

scin_txt contains the string:

"Le clic droit a été désactivé pour cette image. J"

This string is 49 characters long , but the TextLength property returns 53 and the Selection.Start (or Caret.Position , same result) property returns 52 . The caret is at the last position in the string and there are (to my knowledge) no spaces or invisible characters after the letter "J".

Meanwhile the Regex match indexes and length seem correct.

Is this a bug or is there something I don't understand about how the lengths and selection indexes are computed? Is there a workaround to find the word containing the caret?

The Scintilla APIs are badly named. The Text property returns bytes, rather than text, and TextLength gives the number of bytes, not the number of characters.

Presumably, you are using UTF-8 mode, so the "text" is acually:

Le clic droit a \\xc3\\xa9t\\xc3\\xa9 d\\xc3\\xa9sactiv\\xc3\\xa9 pour cette image. J

which is exactly 53 bytes long.

EDIT :

If you want to find the position of the start/end of a word, then there's the SCI_WORDSTARTPOSITION / SCI_WORDENDPOSITION messages. For caret positioning, there's the SCI_POSITIONBEFORE / SCI_POSITIONAFTER messages, which take into account the current code-page. (Presumably these messages all have functional equivalents in the API of the particular Scintilla binding you are using - or perhaps some generic SendMessage function for accessing them).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM