parsing scientific publication page ranges in python

Question

I need to parse a set of strings that contain page ranges as they appear in metadata of scientific and other publications. I don't have a complete spec of the pagination format, and I am not even sure if one exists, but examples of strings I need to process are:

6-10, 19-22
xlvii-xlviii
111S-2S
326
A078-132
XC-CIII

Ideally, I'd like to return the number of pages for each string, eg 9 for 6-10, 19-22 . If that's too hard, at least whether it's a single page or more. The latter is pretty easy actually since commas and dashes seem to be the only delimiters in the examples I've seen so far. But I do very much prefer to get the right count.

I can write my own parser but I am curious whether there are any existing packages that can already do this out of the box or with minimal mods.

Answer 1

Here's a solution that supports parsing "normal" numbers as well as roman numerals. For parsing roman numerals, install the roman package (easy_install roman). You can enhance the parse_num function to support additional formats.

import roman

def parse_num(p):
    p = p.strip()
    try:
        return roman.fromRoman(p.upper())
    except:
        return int(p)

def parse_pages(s):
    count = 0
    for part in s.split(','):
        rng = part.split('-', 1)
        a, b = parse_num(rng[0]), parse_num(rng[-1])
        count += b - a + 1
    return count

>>> parse_pages('17')
1
>>> parse_pages('6-10, 19-22')
9
>>> parse_pages('xlvii-xlviii')
2
>>> parse_pages('XC-CIII')
14

parsing scientific publication page ranges in python

Question

1 answers

solution1
0 2016-11-02 07:12:00

parsing scientific publication page ranges in python

Question

1 answers

solution1 0 2016-11-02 07:12:00

solution1
0 2016-11-02 07:12:00