简体   繁体   English

如何在Java中规范化/抛光文本?

[英]How to normalize/polish a text in Java?

What method would you suggest to normalizing a text in Java, for example 例如,您建议采用哪种方法来规范化Java中的文本

String raw = "  This is\n  a test\n\r  ";
String txt = normalize(raw);
assert txt == "This is a test";

I'm thinking about StringUtils .replace() and .strip() methods, but maybe there is some easier way. 我正在考虑StringUtils .replace().strip()方法,但也许有一些更简单的方法。

如果只是空格,请尝试以下操作

String txt = raw.replaceAll("\\s+", " ").trim();

I see that you have a newline actually in the string that you want to get rid of. 我看到您要删除的字符串中实际上有一个换行符。 In which case I would recommend using a regex like so... 在这种情况下,我建议像这样使用正则表达式...

Pattern.compile("\\s+").matcher(text).replaceAll(" ").trim();

You can alway store the compiled regex for better performance. 您可以始终存储已编译的正则表达式以获得更好的性能。

depends a little on exactly what it is you want to strip. 完全取决于您要剥离的东西。 If its certain specific characters then replaceAll() would be the go as posted by @Yaneeve. 如果它的某些特定字符,则@Yaneeve将发布replaceAll()。 If the needs are more general then you might want to look at normalize the string using the Normalizer . 如果需要更一般,则您可能需要使用Normalizer来对字符串进行标准化

Apache Commons最终添加了此功能: org.apache.commons.lang3.StringUtils.normalizeSpace(String str) // docs

To remove the first and the last spaces you're looking for String#trim() 要删除第一个和最后一个空格,您需要寻找String#trim()

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#trim () http://download.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#trim ()

If normalization means replacing sequences of spaces, tabs, newlines, and linefeeds, then I'd consider using a simple regular expression and String.split() to create separate words, then appending them in a StringBuilder with the spacing you'd like in between. 如果规范化意味着要替换空格,制表符,换行符和换行符的序列,那么我考虑使用一个简单的正则表达式和String.split()来创建单独的单词,然后将它们以所需的间距附加到StringBuilder中之间。 If performance really matters, another approach would be to simply loop over the String's characters, looking at each one and deciding whether to append it to a StringBuilder or to discard it. 如果性能确实很重要,另一种方法是简单地遍历String的字符,查看每个字符并决定是将其附加到StringBuilder还是将其丢弃。

private static String normalize(String raw) {
    StringBuilder sb = new StringBuilder();
    Scanner scanner = new Scanner(raw);
    while (scanner.hasNext()) {
        sb.append(scanner.next());
        sb.append(' ');
    }
    sb.deleteCharAt(sb.length() - 1);
    return sb.toString();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM