简体   繁体   中英

Stanford CoreNLP find homogeneous parts of sentence

I'm trying to build sentence simplification algorithm based on Stanford CoreNLP. One of simplification I want to do - transform sentence with homogeneous parts of sentence to several sentences. Eg

I love my mom, dad and sister. -> I love my mom. I love my dad. I love my sister.

First of all I build semantic graph for input sentence string

    final Sentence parsed = new Sentence(sentence);
    final SemanticGraph dependencies = parsed.dependencyGraph();

The dependency graph for this sentence is

-> love/VBP (root)
  -> I/PRP (nsubj)
  -> mom/NN (dobj)
    -> my/PRP$ (nmod:poss)
    -> ,/, (punct)
    -> dad/NN (conj:and)
    -> and/CC (cc)
    -> sister/NN (conj:and)
  -> dad/NN (dobj)
  -> sister/NN (dobj)

Then I found dobj edges in the graph and nsubj

for (SemanticGraphEdge edge : dependencies.edgeListSorted()) {
        if (edge.getRelation().getShortName().startsWith("dobj")) {
            modifiers.add(edge);
        } else if (edge.getRelation().getShortName().startsWith("nsubj")) {
            subj = edge;
        }
    }

So now I have 3 edges in modifiers and nsubj with I word. And now my proble is how to split the semantic graph into 3 separate graphs. Of course naive solution was just to build sentence base on subj and governor/dependent from dobj edges, but I understand that it's a bad idea and won't work on more complicated examples.

for (final SemanticGraphEdge edge : modifiers) {
                SemanticGraph semanticGraph = dependencies.makeSoftCopy();
                final IndexedWord governor = edge.getGovernor();
                final IndexedWord dependent = edge.getDependent();

                final String governorTag = governor.backingLabel().tag().toLowerCase();
                if (governorTag.startsWith("vb")) {
                    StringBuilder b = new StringBuilder(subj.getDependent().word());
                    b.append(" ")
                            .append(governor.word())
                            .append(" ")
                            .append(dependent.word())
                            .append(". ");
                    System.out.println(b);

                }
            }

Can anyone give me some advices? Maybe I missed something useful in coreNLP documentation? Thanks.

Thanks to @JosepValls for the great idea. Here some code samples how I simplify sentences with 3 or more homogeneous words.

First of all, I defined several regexps for cases

jj(optional) nn, jj(optional) nn, jj(optional) nn and jj(optional) nn
jj(optional) nn, jj(optional) nn, jj(optional) nn , jj(optional) nn ...
jj , jj , jj
jj , jj and jj
vb nn(optional) , vb nn(optional) , vb nn(optional)
 and  so on

Regexps are

Pattern nounAdjPattern = Pattern.compile("(((jj)\\s(nn)|(jj)|(nn))\\s((cc)|,)\\s){2,}((jj)\\s(nn)|(jj)|(nn))");
Pattern verbPatter = Pattern.compile("((vb\\snn|vb)\\s((cc)|,)\\s){2,}((vb\\snn)|vb)");

These pattern will be used to define does input sentence have list of homogeneous word or not and to find boundaries. After that I create list of POSes based on words from original sentence

final Sentence parsed = new Sentence(sentence);
final List<String> words = parsed.words();
List<String> pos = parsed.posTags().stream()
        .map(tag -> tag.length() < 2 ? tag.toLowerCase() : tag.substring(0, 2).toLowerCase())
        .collect(Collectors.toList());

To match this POS structure with regexpes - concat list to string

String posString = pos.stream().collect(Collectors.joining(" "));

If sentence doesn't match any regex - lets return the same string, other way - lets simplify it.

if (!matcher.find()) {
    return new SimplificationResult(Collections.singleton(sentence));
}
return new SimplificationResult(simplify(posString, matcher, words));

In simplify method I'm looking for the boundaries of homogeneous part and extract from words list 3 part - begining and ending, which won't change and homogeneous part, which will be derived into parts. And after deriving homogenous part into pieces - I build several simplified sentences like beginning+piece+ending.

 private Set<String> simplify(String posString, Matcher matcher, List<String> words) {
        String startPOS = posString.substring(0, matcher.start());
        String endPPOS = posString.substring(matcher.end());
        int wordsBeforeCnt = StringUtils.isEmpty(startPOS) ? 0 : startPOS.trim().split("\\s+").length;
        int wordsAfterCnt = StringUtils.isEmpty(endPPOS) ? 0 : endPPOS.trim().split("\\s+").length;
        String wordsBefore = words.subList(0, wordsBeforeCnt)
                .stream()
                .collect(Collectors.joining(" "));
        String wordsAfter = words.subList(words.size() - wordsAfterCnt, words.size())
                .stream()
                .collect(Collectors.joining(" "));
        List<String> homogeneousPart = words.subList(wordsBeforeCnt, words.size() - wordsAfterCnt);
        Set<String> splitWords = new HashSet<>(Arrays.asList(",", "and"));
        Set<String> simplifiedSentences = new HashSet<>();
        StringBuilder sb = new StringBuilder(wordsBefore);
        for (int i = 0; i < homogeneousPart.size(); i++) {
            String part = homogeneousPart.get(i);
            if (!splitWords.contains(part)) {
                sb.append(" ").append(part);
                if (i == homogeneousPart.size() - 1) {
                    sb.append(" ").append(wordsAfter).append(" ");
                    simplifiedSentences.add(sb.toString());
                }
            } else {
                sb.append(" ").append(wordsAfter).append(" ");
                simplifiedSentences.add(sb.toString());
                sb = new StringBuilder(wordsBefore);
            }
        }
        return simplifiedSentences;

So eg sentence

 I love and kiss and adore my beautiful mom, clever dad and sister.

will be simplified into 9 sentences if we are using 2 regexps above

I adore my clever dad . 
I love my clever dad . 
I love my sister . 
I kiss my sister . 
I kiss my clever dad . 
I adore my sister . 
I love my beautiful mom . 
I adore my beautiful mom . 
I kiss my beautiful mom . 

These code works only with 3 or more homogeneous words, cause for 2 words there are lots of exceptions. Eg

Cat eats mouse, dog eats meat.

Than sentence can't be simplified these way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM