简体   繁体   中英

How do I break a string into sections of n word groupings using regex and split?

I'm working on a part of a plagiarism detection software, and need to split a string into subgroups of words using regex and split method.

Let's say we have the following string and want to break it into pieces of three words. In this case split(regex) should split the sentence after each third whitespace.

Sample data: "It is a long established fact that"
Sample output: "It is a", "long established fact"

Here is a simplified version of the code including the part I'm working on. I managed to split after every two words but couldn't do it for n=3.

public class String {
    public void Splitter(String string){
    //string:"It is a long established fact that"
    String[] splitString =string.split("(?<!\\G\\S+)\\s");
    }
}

Output for the code above is as follows:

splitString[0] = "It is"
splitString[1] = "a long"
splitString[2] = "established fact"

Then I come up with this regex (?<=\\\\G\\\\s{2})\\\\s "match every whitespace if there are two other whitespaces before it." and expected the output to be "It is a", "long established fact" but the array was empty.

Here is another regex I just built: ("(?<=(^|\\\\G)\\\\S*\\\\s\\\\S*\\\\s\\\\S*)\\\\s") It almost does the job. The only problem is, the last set of words can consist of less than n words if the total number of words in the sentence is not divisible by n, splitString[3] = "that"

You can´t doit with the split function, the split method Splits this string around matches of the given regular expression, and you dont have a condition for de separator, you have a condition for the string betewn de separators "\\S\\s+\\S\\s+\\S", your aprouch is wrong.

If you need to use regex for that, use Pattern and Matcher class.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public static void main(String[] args) {
    // TODO Auto-generated method stub
     Pattern p = Pattern.compile("\\S+\\s+\\S+\\s+\\S+\\s*|\\S+\\s*$|\\S+\\s+\\S+\\s*$");
     Matcher m = p.matcher("It is a long established fact that");
     String palabras=null;
     do {
        try {
            m.find(); 
            palabras = m.group();
            System.out.println(palabras);
        } catch(IllegalStateException E) {
            break;
        }
     } while(null != palabras && "" != palabras);
}

Output:

It is a 
long established fact
that

A generic regex with the same path for "n" words, where 3 is your n, and 2 is your n-1.

"(\\S+\\s+){3}\\s*|(\\S+\\s*){1,2}$"

Replacing in the code before:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTestPalabra {

    public static void main(String[] args) {
         int n = 3;
         String regexPat = String.format("(\\S+\\s+){%d}\\s*|(\\S+\\s*){1,%d}$", n,n-1);

         Pattern p = Pattern.compile(regexPat);
         Matcher m = p.matcher("It is a long established fact that is ");
         String palabras=null;
         do{
            try{
            m.find(); 
            palabras = m.group();
            System.out.println(palabras);
            }catch(IllegalStateException E){
                break;
            }
         }while(null != palabras && "" != palabras);

    }

}

Change it if you want. Here's just an example how you can break String sentance into separate words, which will be stored in ArrayList.

import java.util.ArrayList;
import java.util.List;

public class Main {
private static String word="";
private static int readerStoppedAt=0;
private static int h;
private static List<String> strings = new ArrayList<>();

public static void main(String[] args) {
      String text;

    //try one of those Strings:
    //text="Hello there, it's just a casual test!";
    //text="It works!!%%%!!||#@|''#'@ Symbols are not a problem";
    text="Good game, well played! Future changes are not necessary!";

    String text2=removeSymbols(text);
    for(int i=0; i<text2.length();i++){
        h=i; //h=where main cycle is currently at
        char c=text2.charAt(i);
        String str=c+"";
        if(str.equals(" ")){
            check(text2, 0);
        }else if(i+1==text2.length()){
            check(text2, 1);
        }
    }
    for (String string : strings) {
        System.out.println(string);
    }
}

private static String removeSymbols(String text) {
    //You can add some symbols if you want
   text=text.replace("!","");
   text=text.replace(",","");
   text=text.replace("@","");
   text=text.replace("#","");
   text=text.replace("|","");
   text=text.replace("''","");
   text=text.replace("'","");
   text=text.replace("%","");
    text=text.replace(":","");
    text=text.replace("(","");
    text=text.replace(")","");
    text=text.replace("{","");
    text=text.replace("}","");
    return text;
}

private static void check(String text, int inc) {
    for(int j=readerStoppedAt;j<h+inc;j++){
        char temp = text.charAt(j);
        String tempStr= temp +"";
        if(!tempStr.equals(" ")){
            word=word+ temp;
        }
    }
    readerStoppedAt=h;
    strings.add(word);
    word="";
}

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM