Finding the surrounding sentence of a char/word in a string

Question

I am trying to get sentences from a string that contain a given substring using python.

I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:

{
  abstract: "...long abstract here..."
  highlights: [
    {
      concept: 'a word',
      start: 1,
      end: 10
    }
    {
      concept: 'cancer',
      start: 123,
      end: 135
    }
  ]
}

I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.

I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize , but by doing that I render the index location useless.

How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?

Answer 1

You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this ( paragraph from a random generator):

Start with,

from nltk.tokenize import sent_tokenize

paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []

Most intuitive way:

for sentence in sent_tokenize(paragraph):
    for highlight in highlights:
        if highlight in sentence:
            sentencesWithHighlights.append(sentence)
            break

But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence , then each highlight , then each subsequence in the sentence for the highlight .

We can get better performance since we know the start index for each highlight:

highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
    for index in highlightIndices:
        if 0 < index - subtractFromIndex < len(sentence):
            sentencesWithHighlights.append(sentence)
            break
    subtractFromIndex += len(sentence)

In either case we get:

sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']

Answer 2

I assume that all your sentences end with one of these three characters: !?.

What about looping over the list of highlights, creating a regexp group:

(?:list|of|your highlights)

Then matching your whole abstract against this regexp:

/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig

This way you would get the sentence containing at least one of your highlights in the first subgroup of each match ( RegExr ).

Answer 3

另一种选择（虽然很难说可变定义文本的可靠性如何），将文本分成一系列句子并对其进行测试：

re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

Finding the surrounding sentence of a char/word in a string

Question

3 answers

solution1
6 ACCPTED 2013-03-20 19:01:35

solution2
1 2013-03-20 17:42:37

solution3
0 2013-03-20 17:59:30

Finding the surrounding sentence of a char/word in a string

Question

3 answers

solution1 6 ACCPTED 2013-03-20 19:01:35

solution2 1 2013-03-20 17:42:37

solution3 0 2013-03-20 17:59:30

solution1
6 ACCPTED 2013-03-20 19:01:35

solution2
1 2013-03-20 17:42:37

solution3
0 2013-03-20 17:59:30