How to extract sentence containing a particular word from millions of paragraphs

Question

I scrapped millions of newspaper articles using Python Scrapy. Now, I want to extract a sentence containing a word. Below is my implementation.

import nltk
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for a in articles:
    article_sentence = tokenizer.tokenize(a)
    for s in article_sentence:
        for w in words:
            if ' '+w+' ' in s:
                sentences[w].append(s)

I have around ~1000 words. The above code is not efficient and takes a lot of time. Also, the sentence can contain root word in different form (past tense). How can I efficiently extract sentence. Please help. Any other tools that I need?

Answer 1

This sounds like a perfect application for the Aho-Corasick string-matching algorithm. It searches a single text (eg your tokenized sentence or document) for multiple strings simultaneously. That simultaneous search will eliminate the inner loop in your initial implementation (including the expensive string concatenation in that loop).

I've only implemented Aho-Corasick it in Java, but a quick Google search yields links to several existing Python implementations. Eg: * ahocorasick * pyhocorasick

I have no experience with either implementation (or any of the other options), but you can probably find one that meets your needs - or implement it yourself if you feel like an enjoyable bit of coding.

My recommendation would be that you include all the word forms of interest in your 'dictionary' trie (the set of matches to search for). Eg if you're searching for 'write', insert both 'write' and 'wrote' into the trie. That will reduce the amount of preprocessing you'll need to do to input documents.

I'd also recommend searching texts as large as practical (perhaps a paragraph or a full document at a time, instead of one sentence at a time), to make more efficient use of each Aho-Corasick invocation.

Answer 2

Could you post a snippet of an article you would like to parse and the words you are looking for.

Based on what you need I would suggest using something like this :

import re
...
...
for s in article_sentence:
    sentence_words = re.split('. ;,!?',s) #whatever delimiters you will need
    if(set(words) & set(sentence_words)): #find the intersection/union
        sentences[w].append(s)

Reference for usage of set : https://docs.python.org/2/library/stdtypes.html#set

How to extract sentence containing a particular word from millions of paragraphs

Question

2 answers

solution1
2 2015-02-02 19:35:22

solution2
0 2015-01-31 19:11:11

How to extract sentence containing a particular word from millions of paragraphs

Question

2 answers

solution1 2 2015-02-02 19:35:22

solution2 0 2015-01-31 19:11:11

solution1
2 2015-02-02 19:35:22

solution2
0 2015-01-31 19:11:11