I need to parse a set of strings that contain page ranges as they appear in metadata of scientific and other publications. I don't have a complete spec of the pagination format, and I am not even sure if one exists, but examples of strings I need to process are:
6-10, 19-22
xlvii-xlviii
111S-2S
326
A078-132
XC-CIII
Ideally, I'd like to return the number of pages for each string, eg 9
for 6-10, 19-22
. If that's too hard, at least whether it's a single page or more. The latter is pretty easy actually since commas and dashes seem to be the only delimiters in the examples I've seen so far. But I do very much prefer to get the right count.
I can write my own parser but I am curious whether there are any existing packages that can already do this out of the box or with minimal mods.
Here's a solution that supports parsing "normal" numbers as well as roman numerals. For parsing roman numerals, install the roman package (easy_install roman). You can enhance the parse_num function to support additional formats.
import roman
def parse_num(p):
p = p.strip()
try:
return roman.fromRoman(p.upper())
except:
return int(p)
def parse_pages(s):
count = 0
for part in s.split(','):
rng = part.split('-', 1)
a, b = parse_num(rng[0]), parse_num(rng[-1])
count += b - a + 1
return count
>>> parse_pages('17')
1
>>> parse_pages('6-10, 19-22')
9
>>> parse_pages('xlvii-xlviii')
2
>>> parse_pages('XC-CIII')
14
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.