java：如何规范化文本？

Question

I want to build index for my program and one of the most important step is to normalize text. 我想为我的程序构建索引，其中一个最重要的步骤是规范化文本。 eg I need to convert "[(Mac Pro @apple)]" to "macproapple", in which I filter blank space, punctuations([()]) and special chars(@). 例如，我需要将“[（Mac Pro @apple）]”转换为“macproapple”，其中我过滤空格，标点符号（[（）]和特殊字符（@）。 My code is like this: 我的代码是这样的：

StringBuilder sb = new StringBuilder(text);
sb = filterPunctuations(sb);
sb = filterSpecialChars(sb);
sb = filterBlankSpace(sb);
sb = toLower(sb);

Because this will generate a lot of String objects, I decide to use StringBuilder. 因为这会生成很多String对象，所以我决定使用StringBuilder。 But I don't know how to do it with StringBuffer. 但我不知道如何使用StringBuffer。 Does any one has some suggestions? 有人有什么建议吗？ I also need to handle chinese characters. 我还需要处理汉字。

Answer 1

You can use replaceAll api with a regular expression 您可以将replaceAll api与正则表达式一起使用

String originalText = "[(Mac Pro @apple)]";
String removedString = originalText.replaceAll("[^\\p{L}\\p{N}]", "").toLowerCase();

Internally replaceAll method uses StringBuffer so you need not worry on multiple objects created in memory. 内部replaceAll方法使用StringBuffer，因此您不必担心在内存中创建的多个对象。

Here is code for replaceAll in Matcher class 这是Matcher类中replaceAll代码

 public String replaceAll(String replacement) {
        reset();
        boolean result = find();
        if (result) {
            StringBuffer sb = new StringBuffer();
            do {
                appendReplacement(sb, replacement);
                result = find();
            } while (result);
            appendTail(sb);
            return sb.toString();
        }
        return text.toString();
    }

Answer 2

Try this- 尝试这个-

class Solution
{
        public static void main (String[] args)
        {
                String s = "[(Mac Pro @apple)]";
                s = s.replaceAll("[^A-Za-z]", "");
                System.out.println(s);
        }
}

This gives the output of 这给出了输出

MacProapple

A small explanation for above lines is- 对上述行的一个小解释是 -

s.replaceAll("[^A-Za-z]", "") removes everything in the string that is not(denoted by ^) in AZ and az. s.replaceAll("[^A-Za-z]", "")删除字符串中AZ（az和az）中未表示的所有内容（由^表示）。 Regex in Java is explained here . 这里解释了Java中的正则表达式。

If you want to convert the string to lowercase at the end, you need to use s.toLowerCase() . 如果要在最后将字符串转换为小写，则需要使用s.toLowerCase() 。

java：如何规范化文本？

问题描述

2 个解决方案

解决方案1
2 已采纳 2012-04-24 06:08:41

解决方案2
1 2012-04-24 05:59:46

java：如何规范化文本？

问题描述

2 个解决方案

解决方案1 2 已采纳 2012-04-24 06:08:41

解决方案2 1 2012-04-24 05:59:46

解决方案1
2 已采纳 2012-04-24 06:08:41

解决方案2
1 2012-04-24 05:59:46