简体   繁体   中英

Scraping pdf newspaper for keywords

I have couple of hundreds of newspapers in pdf format and a list of keywords. My ultimate goal is to get the number of articles mentioning a specific keyword keeping in mind that one pdf might contain multiple articles mentioning the same keyword.

My problem is that when I converted the pdf files to plain text I lost the formatting which makes it impossible to know when an article start and when it ends.

What is the best way to approach this problem because right now I'm thinking it is impossible.

I am currently using python for this project and the pdf library pdfminer. Here is one of the pdfs. http://www.gulf-times.com/PDFLinks/streams/2011/2/27/2_418617_1_255.02.11.pdf

Depending on the format of the text you might be able to come up with some sort of heuristic which identifies a headline - say, it's a line on its own with fewer than 15 words and it doesn't contain a full stop/period character. This will get confused by things like the name of the newspaper, but hopefully they won't have significant amounts of "non-headline" text after them to mess up the results too badly.

This relies on the conversion to text having left every article contiguous (as opposed to just ripping raw columns and mixing the article up). If they're mixed up, I'd say you have very little chance - even if you can find a PDF library which maintains formatting, it's not necessarily easy to tell what constitutes an article's "bounding box". For example, many papers put callouts and other features which could confuse even quite an advanced heuristic.

Actually doing the counting is simple enough. If the assumptions I've mentioned hold, your could would likely end up looking like:

import re
import string

non_word_re = re.compile(r"[^-\w']+")

article = ""
for filename in list_of_text_files:
    with open(filename, "r") as fd:
        for line in fd:
            # Split line on non-word characters and lowercase them for matching.
            words = [i.lower() for i in non_word_re.split(line)
                     if i and i[0] in string.ascii_letters]
            if not words:
                continue
            # Check for headline as the start of a new article.
            if len(words) < 15 and "." not in line:
                if article:
                    # Process previous article
                    handle_article_word_counts(article, counts)
                article = line.strip()
                counts = {}
                continue
            # Only process body text within an article.
            if article:
                for word in words:
                    count[word] = count.get(word, 0) + 1
    if article:
        handle_article_word_counts(article, counts)
    article = ""

You'll need to define handle_article_word_counts() to do whatever indexing of the data you want, but each key in counts will be a potential keyword (including things like and and the , so you may want to drop the most frequent words or something like that).

Basically it depends how accurate you want the results to be. I think the above has some chance of giving you a fair approximation, but it has all the assumptions and caveats I've already mentioned - for example, if it turns out that headlines can span lines then you'll need to modify the heuristics above. Hopefully it'll give you something to build on, at least.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM