简体   繁体   English

使用扫描仪和文本文件仅扫描单词

[英]Scanning in only words with scanner and a text file

I have to read in words from a file. 我必须从文件中读取文字。 For example a sentence might be 例如,一个句子可能是

Bill's favorite animal is a dog. He is buying one at 1:30.

I need to only have the words, while not eliminating apostrophes, but eliminating the 1:30 . 我只需要说一下,而不是消除撇号,而是消除1:30 The desired out put for this would start: 为此所需的输出将开始:

  • Bill's 比尔的
  • favorite 喜爱
  • animal, 动物,
  • ...etc. ...等等。

Code: 码:

Scanner scanner = null;
Pattern pattern=Pattern.compile("[^\\w+]");
String word;

try{
    scanner=new Scanner(file);
}catch(FileNotFoundException e){
    System.out.println("Can't Find the File in Dictionary class!");
}
time=System.nanoTime();
while(scanner.hasNext()){
    scanner.useDelimiter(pattern);
    word=scanner.next();
    System.out.println(word);
    if(!word.equals("")){
        dictionary.add(word);
    }
}

I have tried using delimiter, but that results in Bill and s on separate lines with no ' . 我尝试使用定界符,但结果Bills在没有'单独行中。 I was hoping simply to be able to use 我只是希望能够使用

scanner.next(Pattern.compile("[^\\w+]));

but when I try that I get an InputMismatchException. 但是当我尝试时,我得到了InputMismatchException。 Hopefully someone can help with this! 希望有人可以帮助您! Thanks! 谢谢!

The pattern "[^\\\\w+]" is wrong. 模式"[^\\\\w+]"是错误的。 It is matching any character which is not a letter, digit, underscore, or a plus sign. 它与不是字母,数字,下划线或加号的任何字符匹配。 The plus sign here is not a quantifier, so if your sample text contained "Bill got an A+" it would find the words "Bill", "got", "an", and "A+". 这里的加号不是量词,因此,如果您的示例文本包含“ Bill获得A +”,它将找到单词“ Bill”,“ got”,“ an”和“ A +”。 Is this what you want? 这是你想要的吗? It seems more likely you meant to write "[^\\\\w]+" , which would eliminate the empty strings from the results when there is a run of delimiter characters. 您似乎更想写"[^\\\\w]+" ,当有大量定界符时,它将从结果中消除空字符串。

It seems like you can just add the apostrophe to the pattern. 看来您只需将撇号添加到模式中即可。 If we also move the stray plus sign, that results in a pattern of "[^\\\\w']+" , however, while this is closer, it still includes digits so you will get "1" and "30" as words from "1:30". 如果我们也移动杂散加号,将导致模式为"[^\\\\w']+" ,但是,尽管距离更近,但它仍包含数字,因此您将得到“ 1”和“ 30”作为单词来自“ 1:30”。

I think what you really want is "[^\\\\p{Alpha}']+" which will use runs of one or more characters that are not letters or apostrophes as delimiters, and thus match all runs of letters and apostrophes as tokens. 我认为您真正想要的是"[^\\\\p{Alpha}']+" ,它将使用一个或多个不是字母或撇号的字符作为定界符,从而匹配所有字母和撇号的游标作为标记。 The output would be the following tokens: 输出将是以下标记:

  • Bill's 比尔的
  • favorite 喜爱
  • animal 动物
  • is
  • a 一种
  • dog
  • He
  • is
  • buying 购买
  • one
  • at

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM