简体   繁体   English

删除所有非字母数字字符但允许多字词

[英]remove all non-alphanumeric characters but allow multi-word terms

For a MapReduce job I'm trying to remove all non-alphanumerical characters, stem the token and lowerCase it if it's not an acronym but I want to allow multi-word terms like "life-changing".对于 MapReduce 作业,我试图删除所有非字母数字字符,如果它不是首字母缩略词,则将标记词干并将其小写,但我想允许像“改变生活”这样的多词术语。 This is what I did so far, how should I change it?这是我到目前为止所做的,我应该如何改变它?

 String line = value.toString();
        
         StringTokenizer itr = new StringTokenizer(line);
         
         while (itr.hasMoreTokens())
         { 
            String token = stem(caseFold(itr.nextToken())); 
            token=token.replaceAll("^[^a-zA-Z0-9]*|[^a-zA-Z0-9]*$", "");


             ....
         }

您可以使用公开可用的字典 API,例如 dictionaryapi.com

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM