简体   繁体   中英

Sentence parsing with regular expressions including bullet lists in java

Currently, I use the following regular expression to parse sentences in a document:

Pattern.compile("(?<=\\w[\\w\\)\\]](?<!Mrs?|Dr|Rev|Mr|Ms|vs|abd|ABD|Abd|resp|St|wt)[\\.\\?\\!\\:\\@]\\s)");

This almost works. For example: Given this string:

"Mary had a little lamb (ie lamby pie). Here are its properties: 1. It has four feet 2. It has fleece 3. It is a mammal. It had white fleese. Her father, Mr. Lamb, lives on Mulbery St. in a little white house."

I get the following sentences:

Mary had a little lamb (i.e. lamby pie).
Here are its properties: 
1. It has four feet  2. It has fleece 3. It is a mammal. 
It had white fleese. 
Her father, Mr. Lamb, live on Mulbery St. in a little white house.

However, what I would like is:

Mary had a little lamb (i.e. lamby pie).
Here are its properties: 
1. It has four feet  
2. It has fleece 
3. It is a mammal. 
It had white fleese. 
Her father, Mr. Lamb, lives on Mulbery St. in a little white house.

Is there anyway to do this by altering the existing regular expression?

Right now to accomplish this task, I first do an initial split and then check for bullets. The following code works but I'm wondering if there is a more elegant solution:

public static void doHomeMadeSentenceParser(String temp) {
    Pattern p = Pattern
            .compile("(?<=\\w[\\w\\)\\]](?<!Mrs?|Dr|Rev|Mr|Ms|vs|abd|ABD|Abd|resp|St|wt)[\\.\\?\\!\\:\\@]\\s)");
    String[] sentences = p.split(temp);
    Vector psentences = new Vector();
    Pattern p1 = Pattern.compile("\\b\\d+[.)]\\s");
    for (int x = 0; x < sentences.length; x++) {
        Matcher matcher = p1.matcher(sentences[x]);
        int bstart = 0;
        boolean bulletfound = false;
        while (matcher.find()) {
            bulletfound = true;
            String bullet = sentences[x].substring(bstart, matcher.start());
            if (bullet.length() > 0) {
                psentences.add(bullet);
            }
            bstart = matcher.start();
        }
        if (bulletfound)
            psentences.add(sentences[x].substring(bstart));
        else
            psentences.add(sentences[x]);
    }
    for (int x = 0; x < psentences.size(); x++) {
        String s = (String) psentences.get(x);
        System.out.println(s.trim());
    }
}

Thanks in advance for any help.

Elliott

I'm assuming that you are using the regex to find where to split off the lines. I don't know the regex for this but could you look ahead for a number followed by period(.)?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM