[英]remove all non-alphanumeric characters but allow multi-word terms
For a MapReduce job I'm trying to remove all non-alphanumerical characters, stem the token and lowerCase it if it's not an acronym but I want to allow multi-word terms like "life-changing".对于 MapReduce 作业,我试图删除所有非字母数字字符,如果它不是首字母缩略词,则将标记词干并将其小写,但我想允许像“改变生活”这样的多词术语。 This is what I did so far, how should I change it?
这是我到目前为止所做的,我应该如何改变它?
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens())
{
String token = stem(caseFold(itr.nextToken()));
token=token.replaceAll("^[^a-zA-Z0-9]*|[^a-zA-Z0-9]*$", "");
....
}
您可以使用公开可用的字典 API,例如 dictionaryapi.com
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.