简体   繁体   中英

C# Equivalent for Java's BreakIterator

I'm working on a conversion project from java to c#, is there any c# equivalent for BreakIterator ? I was trying IEnumerator , but cannot find iterator.SetText() usage below, can anyone suggest equivalent C# code for below lines:

String finalResult=""
ArrayList<String> resultList = new ArrayList<String>();
BreakIterator iterator = BreakIterator.getSentenceInstance(currentLocale);
//int counter = 0;
iterator.setText(finalResult);
int lastIndex = iterator.first();
while (lastIndex != BreakIterator.DONE) 
{
int firstIndex = lastIndex;
lastIndex = iterator.next();
if (lastIndex != BreakIterator.DONE) 
{
    String sentence = finalResult.substring(firstIndex, lastIndex);
    resultList.add(sentence);
    System.out.println("sentence = " + sentence);
    //counter++;
}
}

BreakIterator is a mechanism for supporting locale-aware boundary analysis on arbitrary strings of Unicode text. I suspect the Java class is heavily based on (perhaps even directly dependent on, but I'm speculating) the ICU (International Components for Unicode) project: http://site.icu-project.org/

To quote the ICU docs :

Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:

  1. Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.
  2. Locating the beginning of a word that the user has selected.
  3. Counting characters, words, sentences, or paragraphs.
  4. Determining how far to move the text cursor when the user hits an arrow key (Some characters require more than one position in the text store and some characters in the text store do not display at all).
  5. Making a list of the unique words in a document.
  6. Figuring out if a given range of text contains only whole words.
  7. Capitalizing the first letter of each word.
  8. Locating a particular unit of the text (For example, finding the third word in the document).

ICU provides C language bindings, aptly named ICU4C. The ICU FAQ describes ICU4C:

The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).

SIL International provides C# language bindings, which allow you to use ICU4C in C# applications, via a project named icu-dotnet.

You can find the official icu-dotnet repository on Github:
https://github.com/sillsdev/icu-dotnet

Or, install it via Nuget:
https://www.nuget.org/packages/icu.net/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM