简体   繁体   English

Java StringTokenizer奇怪的行为

[英]Java StringTokenizer odd behavior

I'm trying to extract only lowercase alphanumerical characters from a document with this: 我试图以此从文档中仅提取小写字母数字字符:

String delim = "abcdefghijklmnopqrstuvwxyz0123456789";

StringTokenizer strtok = new StringTokenizer(str, delim, true);

String newstr = "";

while (strtok.hasMoreTokens()) {
    newstr = newstr + strtok.nextToken();
}

return newstr;

Note that the document is already lowercase only. 请注意,该文档仅是小写字母。 But for some reason all of the punctuation characters are still being returned along with parethesis and /'s, etc. 但是由于某种原因,所有标点符号仍会与复述和/一起返回。

I thought using the true boolean in the creation of the tokenizer would count delimiters as tokens? 我认为在创建分词器时使用真正的布尔值会将分隔符算作令牌吗?

The delim argument is a delimiter. delim参数是一个定界符。 You're basically asking for each token to be "whatever is between lower case letters". 您基本上是在要求每个令牌是“小写字母之间的任何字符”。 Then the 'true' argument says "give me those letters on the edges too". 然后“ true”参数说“也将那些字母也给我”。 Were you looking for replaceAll("[^abcdefghijklmnopqrstuvwxyz0123456789]","") ? 您是否在寻找replaceAll("[^abcdefghijklmnopqrstuvwxyz0123456789]","")吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM