简体   繁体   中英

Joint segmentation of the sentence using NLP tools

I am trying to look for an example in joint segmentation (I'm not sure if it's a correct term) using some advanced NLP tools. For example, if I have a sentence like this:

I like cats and/or dogs.

And I would like to have something like this:

1. I like cats.
2. I like dogs.

Is there a way to do this?

Or maybe a more complicated example. If I have:

The steering system shall ensure easy and safe handling of the vehicle up to its maximum design speed
or in case of a trailer up to its technically permitted maximum speed.

Is it possible to get something like:

1. The steering system shall ensure easy and safe handling of the vehicle up to its maximum design speed.
2. In case of a trailer, the steering system shall ensure easy and safe handling of the vehicle up to its technically permitted maximum speed.

There are going to be many long sentences like that, so I guess, regular expressions is not the best solution. Or is it? I cannot find much info about this topic.

I think that the right tool for such segmentation is to use syntactic dependency trees. For your first example sentence, the dependency tree would look like this (you can reproduce it in the Colab notebook ):

#!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load('en_core_web_lg')
from spacy import display
displacy.render(nlp("I like cats and dogs."), style="dep", jupyter=True)

在此处输入图像描述

The arc "conj" means "conjunction" - this is what words like and/or/but etc. do: "conjuncting" phrases together. In this example, the word "dogs" is conjuncted to the word "cats". Therefore, you can traverse the tree, looking for "conj" dependencies, and leaving only one variant at a time.

The code for it could look like

import itertools

def generate_trees(root):
    """
    Yield all conjuncted variants of subtrees that can be generated from the given node.
    A subtree here is just a set of nodes.
    """
    prev_result = [root]
    if not root.children:
        yield prev_result
        return

    children_deps = {c.dep_ for c in root.children}
    if 'conj' in children_deps:
        # generate two options: subtree without cc+conj, or with conj child replacing the root
        # the first option:
        good_children = [c for c in root.children if c.dep_ not in {'cc', 'conj'}]
        for subtree in combine_children(prev_result, good_children):
            yield subtree 
        # the second option
        for child in root.children:
            if child.dep_ == 'conj':
                for subtree in generate_trees(child):
                    yield subtree
    else:
        # otherwise, just combine all the children subtrees
        for subtree in combine_children([root], root.children):
            yield subtree

def combine_children(prev_result, children):
    """ Combine the parent subtree with all variants of the children subtrees """
    child_lists = []
    for child in children:
        child_lists.append(list(generate_trees(child)))
    for prod in itertools.product(*child_lists):  # all possible combinations
        yield prev_result + [tok for parts in prod for tok in parts]

If we try to apply this code to the easy example, it would work just as expected:

text = 'I like cats and dogs.'
doc = nlp(text)
sentence = list(doc.sents)[0]
for tree in generate_trees(sentence.root):
    print(' '.join([token.text for token in sorted(tree, key=lambda x: x.i)]))
# I like cats .
# I like dogs .

The second example proves indeed more difficult:

text = 'The steering system shall ensure easy and safe handling of the vehicle up to its maximum design speed or in case of a trailer up to its technically permitted maximum speed.'
doc = nlp(text)
sentence = list(doc.sents)[0]
for tree in generate_trees(sentence.root):
    print(' '.join([token.text for token in sorted(tree, key=lambda x: x.i)]))
# The steering system shall ensure easy handling of the vehicle up to its maximum design speed .
# The steering system shall ensure easy handling of the vehicle up in case of a trailer up to its technically permitted maximum speed .
# The steering system shall ensure safe handling of the vehicle up to its maximum design speed .
# The steering system shall ensure safe handling of the vehicle up in case of a trailer up to its technically permitted maximum speed .

You see that my code has generated 4 different examples instead of 2, but technically it is correct, because "easy and safe" is indeed a conjunction of 2 independent predicates. The more troubling problem is the strange "up in case" construction. This is the result of "or" being erroneously attached to "to" instead of "up". Maybe, another parser (not from Spacy) would do better. But I wouldn't expect too much from it, because the sentence is really a difficult one.

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM