[英]java: how to normalize text?
I want to build index for my program and one of the most important step is to normalize text. 我想为我的程序构建索引,其中一个最重要的步骤是规范化文本。 eg I need to convert "[(Mac Pro @apple)]" to "macproapple", in which I filter blank space, punctuations([()]) and special chars(@). 例如,我需要将“[(Mac Pro @apple)]”转换为“macproapple”,其中我过滤空格,标点符号([()]和特殊字符(@)。 My code is like this: 我的代码是这样的:
StringBuilder sb = new StringBuilder(text);
sb = filterPunctuations(sb);
sb = filterSpecialChars(sb);
sb = filterBlankSpace(sb);
sb = toLower(sb);
Because this will generate a lot of String objects, I decide to use StringBuilder. 因为这会生成很多String对象,所以我决定使用StringBuilder。 But I don't know how to do it with StringBuffer. 但我不知道如何使用StringBuffer。 Does any one has some suggestions? 有人有什么建议吗? I also need to handle chinese characters. 我还需要处理汉字。
You can use replaceAll
api with a regular expression 您可以将replaceAll
api与正则表达式一起使用
String originalText = "[(Mac Pro @apple)]";
String removedString = originalText.replaceAll("[^\\p{L}\\p{N}]", "").toLowerCase();
Internally replaceAll
method uses StringBuffer so you need not worry on multiple objects created in memory. 内部replaceAll
方法使用StringBuffer,因此您不必担心在内存中创建的多个对象。
Here is code for replaceAll
in Matcher
class 这是Matcher
类中replaceAll
代码
public String replaceAll(String replacement) {
reset();
boolean result = find();
if (result) {
StringBuffer sb = new StringBuffer();
do {
appendReplacement(sb, replacement);
result = find();
} while (result);
appendTail(sb);
return sb.toString();
}
return text.toString();
}
Try this- 尝试这个-
class Solution
{
public static void main (String[] args)
{
String s = "[(Mac Pro @apple)]";
s = s.replaceAll("[^A-Za-z]", "");
System.out.println(s);
}
}
This gives the output of 这给出了输出
MacProapple
A small explanation for above lines is- 对上述行的一个小解释是 -
s.replaceAll("[^A-Za-z]", "")
removes everything in the string that is not(denoted by ^) in AZ and az. s.replaceAll("[^A-Za-z]", "")
删除字符串中AZ(az和az)中未表示的所有内容(由^表示)。 Regex in Java is explained here . 这里解释了Java中的正则表达式。
If you want to convert the string to lowercase at the end, you need to use s.toLowerCase()
. 如果要在最后将字符串转换为小写,则需要使用s.toLowerCase()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.