简体   繁体   中英

Storing words from a .txt file into a String array

I was going through the answers of this question asked by someone previously and I found them to be very helpful. However, I have a question about the highlighted answer but I wasn't sure if I should ask there since it's a 6 year old thread.

My question is about this snippet of code given in the answers:

private static boolean isAWord(String token)
{
    //check if the token is a word
}

How would you check that the token is a word? Would you .contains("\\s+") the string and check to see if it contains characters between them? But what about when you encounter a paragraph? I'm not sure how to go about this.

EDIT: I think I should've elaborated a bit more. Usually, you'd think a word would be something surrounded by " " but, for example, if the file contains a hyphen (which is also surrounded by a blank space), you'd want the isAWord() method to return false. How can I verify that something is actually a word and not punctuation?

Since the question wasn't entirely clear, I made two methods. First method consistsOfLetters just goes through the whole string and returns false if it has any numbers/symbols. This should be enough to determine if a token is word (if you don't mind if that words exists in dictionary or not).

public static boolean consistsOfLetters(String string) {
        for(int i=0; i<string.length(); i++) {
            if(string.charAt(i) == '.' && (i+1) == string.length() && string.length() != 1) break; // if last char of string is ., it is still word
            if((string.toLowerCase().charAt(i) < 'a' || string.toLowerCase().charAt(i) > 'z')) return false; 
        }  // toLowerCase is used to avoid having to compare it to A and Z
        return true;
    }
        

Second method helps us divide original String (for example a sentence of potentional words) based on " " character. When that is done, we go through every element there and check if it is a word. If it's not a word it returns false and skips the rest. If everything is fine, returns true.

    public static boolean isThisAWord(String string) {
        String[] array = string.split(" ");
        for(int i = 0; i < array.length; i++) {
            if(consistsOfLetters(array[i]) == false) return false;
        }
        return true;
    }

Also, this might not work for English since English has apostrophes in words like "don't" so a bit of further tinkering is needed.

The Scanner in java splits string using his WHITESPACE_PATTERN by default, so splitting a string like "He's my friend" would result in an array like ["He's", "my", "friend"] . If that is sufficient, just remove that if clause and dont use that method. If you want to make it to "He","is" instead of "He's" , you need a different approach.

In short: The method works like verification check -> if the given token is not supposed to be in the result, then return false, true otherwise.

return token.matches("[\\pL\\pM]+('(s|nt))?");

matches requires the entire string to match.

This takes letters \pL and zero-length combining diacritical marks \pM (accents). And possibly for English apostrophe, should you consider doesn't and let's one term (for instance for translation purposes). You might also consider hyphens.

There are several single quotes and dashes.

Path path = Paths.get("..../x.txt");
Charset charset = Charset.defaultCharset();
String content = Files.readString(path, charset)
Pattern wordPattern = Pattern.compile("[\\pL\\pM]+");
Matcher m = wordPattern.matcher(content);
while (m.find()) {
    String word = m.group(); ...
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM