简体   繁体   中英

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.

I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:

{
  abstract: "...long abstract here..."
  highlights: [
    {
      concept: 'a word',
      start: 1,
      end: 10
    }
    {
      concept: 'cancer',
      start: 123,
      end: 135
    }
  ]
}

I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.

I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize , but by doing that I render the index location useless.

How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?

You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this ( paragraph from a random generator):

Start with,

from nltk.tokenize import sent_tokenize

paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []

Most intuitive way:

for sentence in sent_tokenize(paragraph):
    for highlight in highlights:
        if highlight in sentence:
            sentencesWithHighlights.append(sentence)
            break

But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence , then each highlight , then each subsequence in the sentence for the highlight .

We can get better performance since we know the start index for each highlight:

highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
    for index in highlightIndices:
        if 0 < index - subtractFromIndex < len(sentence):
            sentencesWithHighlights.append(sentence)
            break
    subtractFromIndex += len(sentence)

In either case we get:

sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']

I assume that all your sentences end with one of these three characters: !?.

What about looping over the list of highlights, creating a regexp group:

(?:list|of|your highlights)

Then matching your whole abstract against this regexp:

/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig

This way you would get the sentence containing at least one of your highlights in the first subgroup of each match ( RegExr ).

另一种选择(虽然很难说可变定义文本的可靠性如何),将文本分成一系列句子并对其进行测试:

re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM