I am writing a program to count the total number of valid English words in a text file. In this code, I want to ignore words that contain number/numbers or special characters eg "word123", "123word ", "word&&", "$name". Currently my program detects words that start with numbers eg "123number". However cannot detect "number123". Can anyone tell me how should I move forward ? Below is my code:
public int wordCounter(String filePath) throws FileNotFoundException{
File f = new File(filePath);
Scanner scanner = new Scanner(f);
int nonWord = 0;
int count = 0;
String regex = "[a-zA-Z].*";
while(scanner.hasNext()){
String word = scanner.next();
if(word.matches(regex)){
count++;
}
else{
nonWord++;
}
}
return count;
}
Lose the dot:
String regex = "[a-zA-Z]*"; // more correctly "[a-zA-Z]+", but both will work here
The dot means "any character", but you want a regex that means "only composed of letters".
BTW, you can also express this more succinctly (although arguably less readably) using a POSIX expression:
String regex = "\\p{L}+";
The regex \\p{L}
means "any letter".
To extend the expression to include the apostrophe, which can appear at the start, eg 'tis
, the middle eg can't
or the end eg Jesus'
, but not more than once:
String regex = "(?!([^']*'){2})['\\p{L}]+";
Use regex ^[a-zA-Z-]+$ for word match.
public int wordCounter(String filePath) throws FileNotFoundException
{
File f = new File(filePath);
Scanner scanner = new Scanner(f);
int nonWord = 0;
int count = 0;
String regex = "^[a-zA-Z-]+$";
while(scanner.hasNext()){
String word = scanner.next();
if(word.matches(regex)){
count++;
}
else{
nonWord++;
}
}
return count;
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.