使用扫描仪和文本文件仅扫描单词

Question

I have to read in words from a file. 我必须从文件中读取文字。 For example a sentence might be 例如，一个句子可能是

Bill's favorite animal is a dog. He is buying one at 1:30.

I need to only have the words, while not eliminating apostrophes, but eliminating the 1:30 . 我只需要说一下，而不是消除撇号，而是消除1:30 。 The desired out put for this would start: 为此所需的输出将开始：

Bill's 比尔的
favorite 喜爱
animal, 动物，
...etc. ...等等。

Code: 码：

Scanner scanner = null;
Pattern pattern=Pattern.compile("[^\\w+]");
String word;

try{
    scanner=new Scanner(file);
}catch(FileNotFoundException e){
    System.out.println("Can't Find the File in Dictionary class!");
}
time=System.nanoTime();
while(scanner.hasNext()){
    scanner.useDelimiter(pattern);
    word=scanner.next();
    System.out.println(word);
    if(!word.equals("")){
        dictionary.add(word);
    }
}

I have tried using delimiter, but that results in Bill and s on separate lines with no ' . 我尝试使用定界符，但结果Bill和s在没有'单独行中。 I was hoping simply to be able to use 我只是希望能够使用

scanner.next(Pattern.compile("[^\\w+]));

but when I try that I get an InputMismatchException. 但是当我尝试时，我得到了InputMismatchException。 Hopefully someone can help with this! 希望有人可以帮助您！ Thanks! 谢谢！

Answer 1

The pattern "[^\\\\w+]" is wrong. 模式"[^\\\\w+]"是错误的。 It is matching any character which is not a letter, digit, underscore, or a plus sign. 它与不是字母，数字，下划线或加号的任何字符匹配。 The plus sign here is not a quantifier, so if your sample text contained "Bill got an A+" it would find the words "Bill", "got", "an", and "A+". 这里的加号不是量词，因此，如果您的示例文本包含“ Bill获得A +”，它将找到单词“ Bill”，“ got”，“ an”和“ A +”。 Is this what you want? 这是你想要的吗？ It seems more likely you meant to write "[^\\\\w]+" , which would eliminate the empty strings from the results when there is a run of delimiter characters. 您似乎更想写"[^\\\\w]+" ，当有大量定界符时，它将从结果中消除空字符串。

It seems like you can just add the apostrophe to the pattern. 看来您只需将撇号添加到模式中即可。 If we also move the stray plus sign, that results in a pattern of "[^\\\\w']+" , however, while this is closer, it still includes digits so you will get "1" and "30" as words from "1:30". 如果我们也移动杂散加号，将导致模式为"[^\\\\w']+" ，但是，尽管距离更近，但它仍包含数字，因此您将得到“ 1”和“ 30”作为单词来自“ 1:30”。

I think what you really want is "[^\\\\p{Alpha}']+" which will use runs of one or more characters that are not letters or apostrophes as delimiters, and thus match all runs of letters and apostrophes as tokens. 我认为您真正想要的是"[^\\\\p{Alpha}']+" ，它将使用一个或多个不是字母或撇号的字符作为定界符，从而匹配所有字母和撇号的游标作为标记。 The output would be the following tokens: 输出将是以下标记：

Bill's 比尔的
favorite 喜爱
animal 动物
is 是
a 一种
dog 狗
He 他
is 是
buying 购买
one 一
at 在

使用扫描仪和文本文件仅扫描单词

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-04-04 22:45:25

使用扫描仪和文本文件仅扫描单词

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-04-04 22:45:25

解决方案1
0 已采纳 2014-04-04 22:45:25