简体   繁体   English

将 .txt 文件中的单词存储到字符串数组中

[英]Storing words from a .txt file into a String array

I was going through the answers of this question asked by someone previously and I found them to be very helpful.我正在查看以前有人提出的这个问题的答案,我发现它们非常有帮助。 However, I have a question about the highlighted answer but I wasn't sure if I should ask there since it's a 6 year old thread.但是,我对突出显示的答案有疑问,但我不确定是否应该在那里问,因为它是一个 6 岁的线程。

My question is about this snippet of code given in the answers:我的问题是关于答案中给出的这段代码:

private static boolean isAWord(String token)
{
    //check if the token is a word
}

How would you check that the token is a word?如何检查令牌是否为单词? Would you .contains("\\s+") the string and check to see if it contains characters between them?你会.contains("\\s+")字符串并检查它是否包含它们之间的字符? But what about when you encounter a paragraph?但是当你遇到一个段落怎么办? I'm not sure how to go about this.我不确定如何 go 关于这个。

EDIT: I think I should've elaborated a bit more.编辑:我想我应该详细说明一下。 Usually, you'd think a word would be something surrounded by " " but, for example, if the file contains a hyphen (which is also surrounded by a blank space), you'd want the isAWord() method to return false.通常,您会认为单词会被“”包围,但是,例如,如果文件包含连字符(也被空格包围),您会希望isAWord()方法返回 false。 How can I verify that something is actually a word and not punctuation?如何验证某事实际上是一个单词而不是标点符号?

Since the question wasn't entirely clear, I made two methods.由于问题不完全清楚,我做了两种方法。 First method consistsOfLetters just goes through the whole string and returns false if it has any numbers/symbols.第一个方法包括整个字符串,如果它有任何数字/符号,则返回 false。 This should be enough to determine if a token is word (if you don't mind if that words exists in dictionary or not).这应该足以确定标记是否是单词(如果您不介意该单词是否存在于字典中)。

public static boolean consistsOfLetters(String string) {
        for(int i=0; i<string.length(); i++) {
            if(string.charAt(i) == '.' && (i+1) == string.length() && string.length() != 1) break; // if last char of string is ., it is still word
            if((string.toLowerCase().charAt(i) < 'a' || string.toLowerCase().charAt(i) > 'z')) return false; 
        }  // toLowerCase is used to avoid having to compare it to A and Z
        return true;
    }
        

Second method helps us divide original String (for example a sentence of potentional words) based on " " character.第二种方法帮助我们根据“ ”字符划分原始字符串(例如一个潜在词的句子)。 When that is done, we go through every element there and check if it is a word.完成后,我们 go 遍历那里的每个元素并检查它是否是一个单词。 If it's not a word it returns false and skips the rest.如果不是单词,则返回 false 并跳过 rest。 If everything is fine, returns true.如果一切正常,则返回 true。

    public static boolean isThisAWord(String string) {
        String[] array = string.split(" ");
        for(int i = 0; i < array.length; i++) {
            if(consistsOfLetters(array[i]) == false) return false;
        }
        return true;
    }

Also, this might not work for English since English has apostrophes in words like "don't" so a bit of further tinkering is needed.此外,这可能不适用于英语,因为英语在“不要”之类的词中有撇号,因此需要进一步修改。

The Scanner in java splits string using his WHITESPACE_PATTERN by default, so splitting a string like "He's my friend" would result in an array like ["He's", "my", "friend"] . java 中的扫描仪默认使用他的WHITESPACE_PATTERN分割字符串,因此分割像"He's my friend"这样的字符串会产生像["He's", "my", "friend"]这样的数组。 If that is sufficient, just remove that if clause and dont use that method.如果这足够了,只需删除该 if 子句并且不要使用该方法。 If you want to make it to "He","is" instead of "He's" , you need a different approach.如果你想用"He","is"而不是"He's" ,你需要一种不同的方法。

In short: The method works like verification check -> if the given token is not supposed to be in the result, then return false, true otherwise.简而言之:该方法的工作方式类似于验证检查 -> 如果给定的令牌不应该出现在结果中,则返回 false,否则返回 true。

return token.matches("[\\pL\\pM]+('(s|nt))?");

matches requires the entire string to match. matches要求匹配整个字符串。

This takes letters \pL and zero-length combining diacritical marks \pM (accents).这需要字母\pL和零长度组合变音符号\pM (重音)。 And possibly for English apostrophe, should you consider doesn't and let's one term (for instance for translation purposes).并且可能是英文撇号,如果您认为doesn'tlet's使用一个术语(例如用于翻译目的)。 You might also consider hyphens.您也可以考虑使用连字符。

There are several single quotes and dashes.有几个单引号和破折号。

Path path = Paths.get("..../x.txt");
Charset charset = Charset.defaultCharset();
String content = Files.readString(path, charset)
Pattern wordPattern = Pattern.compile("[\\pL\\pM]+");
Matcher m = wordPattern.matcher(content);
while (m.find()) {
    String word = m.group(); ...
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM