[英]Split string into list of substrings of different character types
I am writing a spell checker that takes a text file as input and outputs the file with spelling corrected. 我正在编写一个拼写检查器,它将文本文件作为输入并输出已纠正拼写的文件。
The program should preserve formatting and punctuation. 该程序应保留格式和标点符号。
I want to split the input text into a list of string tokens such that each token is either 1 or more: word, punctuation, whitespace, or digit characters
. 我想将输入文本拆分为字符串标记的列表,以便每个标记为1或更多: word, punctuation, whitespace, or digit characters
。
For example: 例如:
Input: 输入:
words.txt: words.txt:
asdf don't ]'.'..;'' as12....asdf. asdf不要]'。'..;''as12 .... asdf。
asdf 音频文件
Input as list: 输入为列表:
["asdf" , " " , "don't" , " " , "]'.'..;''" , " " , "as" , "12" , "...." , "asdf" , "." [“ asdf”,“”,“不要”,“”,“]'.... ;;”;“,”“,” as“,” 12“,” ....“,” asdf“ ,“。 , "\\n" , "asdf"] ,“ \\ n”,“ asdf”]
Words like won't
and i'll
should be treated as a single token. 诸如“ won't
和“ i'll
”之类的单词应被视为单个标记。
Having the data in this format would allow me to process the tokens like so: 具有这种格式的数据将使我可以像这样处理令牌:
String output = "";
for(String token : tokens) {
if(isWord(token)) {
if(!inDictionary(token)) {
token = correctSpelling(token);
}
}
output += token;
}
So my main question is how can i split a string of text into a list of substrings as described above? 所以我的主要问题是如何如上所述将文本字符串拆分为子字符串列表? Thank you. 谢谢。
The main difficulty here would be to find the regex that matches what you consider to be a "word". 这里的主要困难是找到与您认为是“单词”匹配的正则表达式。 For my example I consider ' to be part of a word if it's proceeded by a letter or if the following character is a letter: 在我的示例中,如果单词以字母开头或以下字符为字母,则我将其视为单词的一部分:
public static void main(String[] args) {
String in = "asdf don't ]'.'..;'' as12....asdf.\nasdf";
//The pattern:
Pattern p = Pattern.compile("[\\p{Alpha}][\\p{Alpha}']*|'[\\p{Alpha}]+");
Matcher m = p.matcher(in);
//If you want to collect the words
List<String> words = new ArrayList<String>();
StringBuilder result = new StringBuilder();
Now find something from the start
int pos = 0;
while(m.find(pos)) {
//Add everything from starting position to beginning of word
result.append(in.substring(pos, m.start()));
//Handle dictionary logig
String token = m.group();
words.add(token); //not used actually
if(!inDictionary(token)) {
token = correctSpelling(token);
}
//Add to result
result.append(token);
//Repeat from end position
pos = m.end();
}
//Append remainder of input
result.append(in.substring(pos));
System.out.println("Result: " + result.toString());
}
Because I like solving puzzles, I tried the following and I think it works fine: 因为我喜欢解决难题,所以我尝试了以下方法,并且认为效果很好:
public class MyTokenizer {
private final String str;
private int pos = 0;
public MyTokenizer(String str) {
this.str = str;
}
public boolean hasNext() {
return pos < str.length();
}
public String next() {
int type = getType(str.charAt(pos));
StringBuilder sb = new StringBuilder();
while(hasNext() && (str.charAt(pos) == '\'' || type == getType(str.charAt(pos)))) {
sb.append(str.charAt(pos));
pos++;
}
return sb.toString();
}
private int getType(char c) {
String sc = Character.toString(c);
if (sc.matches("\\d")) {
return 0;
}
else if (sc.matches("\\w")) {
return 1;
}
else if (sc.matches("\\s")) {
return 2;
}
else if (sc.matches("\\p{Punct}")) {
return 3;
}
else {
return 4;
}
}
public static void main(String... args) {
MyTokenizer mt = new MyTokenizer("asdf don't ]'.'..;'' as12....asdf.\nasdf");
while(mt.hasNext()) {
System.out.println(mt.next());
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.