简体   繁体   English

BreakIterator如何在Android中运行?

[英]How does BreakIterator work in Android?

I'm making my own text processor in Android (a custom vertical script TextView for Mongolian). 我正在Android中创建自己的文本处理器(一个自定义垂直脚本TextView for Mongolian)。 I thought I would have to find all the line breaking locations myself so that I could implement line wrapping, but then I discovered BreakIterator . 我想我必须自己找到所有的断线位置,以便我可以实现换行,但后来我发现了BreakIterator This seems to find all the possible breaks between characters, words, lines, and sentences in various languages. 这似乎找到了各种语言中的字符,单词,行和句子之间的所有可能的中断。

I'm trying to learn how to use it. 我正在努力学习如何使用它。 The documentation was more helpful than average, but it was still difficult to understand from just reading. 文档比平均更有帮助,但仅仅通过阅读仍然难以理解。 I also found a few tutorials (see here , here , and here ) but they lacked the full explanation with output that I was looking for. 我还找到了一些教程(见这里这里这里 ),但他们缺乏我正在寻找的输出的完整解释。

I am adding this Q&A style answer to help myself learn how to use BreakIterator . 我正在添加这个Q&A风格的答案,以帮助自己学习如何使用BreakIterator

I'm making this an Android tag in addition to Java because there is apparently some difference between them. 除了Java之外,我正在制作这个Android标签,因为它们之间显然存在一些差异 Also, Android now supports the ICU BreakIterator and future answers may deal with this. 此外,Android现在支持ICU BreakIterator ,未来的答案可能会解决这个问题。

BreakIterator can be used to find the possible breaks between characters, words, lines, and sentences. BreakIterator可用于查找字符,单词,行和句子之间可能的中断。 This is useful for things like moving the cursor through visible characters, double clicking to select words, triple clicking to select sentences, and line wrapping. 这对于将光标移动到可见字符,双击以选择单词,三击以选择句子和换行等内容非常有用。

Boilerplate code Boilerplate代码

The following code is used in the examples below. 以下示例中使用了以下代码。 Just adjust the first part to change the text and type of BreakIterator . 只需调整第一部分即可更改BreakIterator的文本和类型。

// change these two lines for the following examples
String text = "This is some text.";
BreakIterator boundary = BreakIterator.getCharacterInstance();

// boiler plate code
boundary.setText(text);
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; end = boundary.next()) {
    System.out.println(start + " " + text.substring(start, end));
    start = end;
}

If you just want to test this out, you can paste it directly into an Activity's onCreate in Android. 如果您只想测试一下,可以将其直接粘贴到Android中的Activity的onCreate中。 I'm using System.out.println rather than Log so that it is also testable in a Java only environment. 我正在使用System.out.println而不是Log因此它也可以在仅Java环境中测试。

I'm using the java.text.BreakIterator rather than the ICU one, which is only available from API 24. See the links at the bottom for more information. 我正在使用java.text.BreakIterator而不是ICU,它只能从API 24获得。有关详细信息,请参阅底部的链接。

Characters 人物

Change the boilerplate code to include the following 更改样板代码以包含以下内容

String text = "Hi 中文éé\uD83D\uDE00\uD83C\uDDEE\uD83C\uDDF3.";
BreakIterator breakIterator = BreakIterator.getCharacterInstance();

Output 产量

0 H
1 i
2  
3 中
4 文
5 é
6 é
8 😀
10 🇮🇳
14 .

The most interest parts are at indexes 6 , 8 , and 10 . 最让人感兴趣的部分是在指数68 ,和10 Your browser may or may not display the characters correctly, but a user would interpret all of these to be single characters even though they are made up of multiple UTF-16 values. 您的浏览器可能会或可能不会正确显示字符,但用户会将所有这些字符解释为单个字符,即使它们由多个UTF-16值组成。

Words

Change the boilerplate code to include the following: 更改样板代码以包含以下内容:

String text = "I like to eat apples. 我喜欢吃苹果。";
BreakIterator boundary = BreakIterator.getWordInstance();

Output 产量

0 I
1  
2 like
6  
7 to
9  
10 eat
13  
14 apples
20 .
21  
22 我
23 喜欢
25 吃
26 苹果
28 。

There are a few interesting things to note here. 这里有一些有趣的事情需要注意。 First, a word break is detected at both sides of a space. 首先,在空间的两侧检测到断字。 Second, even though there are different languages, multi-character Chinese words were still recognized. 其次,即使有不同的语言,仍然可以识别出多字符的中文单词。 This was still true in my tests even when I set the locale to Locale.US . 即使我将语言环境设置为Locale.US在我的测试中仍然如此。

Lines

You can keep the code the same as for the Words example: 您可以保持代码与Words示例相同:

String text = "I like to eat apples. 我喜欢吃苹果。";
BreakIterator boundary = BreakIterator.getLineInstance();

Output 产量

0 I 
2 like 
7 to 
10 eat 
14 apples. 
22 我
23 喜
24 欢
25 吃
26 苹
27 果。

Note that the break locations are not whole lines of text. 请注意,中断位置不是整行文本。 They are just convenient places to line wrap text. 它们只是换行文本的便利位置。

The output is similar to the Words example. 输出类似于Words示例。 However, now white space and punctuation is included with the word before it. 但是,现在前面的单词中包含空格和标点符号。 This makes sense because you wouldn't want a new line to start with white space or punctuation. 这是有道理的,因为您不希望新行以空格或标点符号开头。 Also note that Chinese characters get line breaks for every character. 另请注意,中文字符会为每个字符添加换行符。 This is consistent with the fact that it is ok to break multi-character words across lines in Chinese. 这与在中文中跨行打破多字符单词的事实是一致的。

Sentences 句子

Change the boilerplate code to include the following: 更改样板代码以包含以下内容:

String text = "I like to eat apples. My email is me@example.com.\n" +
        "This is a new paragraph. 我喜欢吃苹果。我不爱吃臭豆腐。";
BreakIterator boundary = BreakIterator.getSentenceInstance();

Output 产量

0 I like to eat apples. 
22 My email is me@example.com.
50 This is a new paragraph. 
75 我喜欢吃苹果。
82 我不爱吃臭豆腐。

Correct sentence breaks were recognized in multiple languages. 以多种语言识别正确的句子休息时间。 Also, there was no false positive for the dot in the email domain. 此外,电子邮件域中的点没有误报。

Notes 笔记

You can set the Locale when you create a BreakIterator , but if you don't it just uses the default locale . 您可以在创建BreakIterator时设置Locale ,但如果不这样做, BreakIterator使用默认语言环境

Further reading 进一步阅读

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM