简体   繁体   English

在python中解析科学出版物页面范围

[英]parsing scientific publication page ranges in python

I need to parse a set of strings that contain page ranges as they appear in metadata of scientific and other publications. 我需要分析一组字符串,这些字符串包含页面范围,因为它们出现在科学和其他出版物的元数据中。 I don't have a complete spec of the pagination format, and I am not even sure if one exists, but examples of strings I need to process are: 我没有关于分页格式的完整规范,我甚至不确定是否存在分页格式,但是我需要处理的字符串示例如下:

6-10, 19-22
xlvii-xlviii
111S-2S
326
A078-132
XC-CIII

Ideally, I'd like to return the number of pages for each string, eg 9 for 6-10, 19-22 . 理想情况下,我想返回每个字符串的页数,例如9表示6-10, 19-22 If that's too hard, at least whether it's a single page or more. 如果这太难了,至少是一页还是更多。 The latter is pretty easy actually since commas and dashes seem to be the only delimiters in the examples I've seen so far. 实际上,后者非常容易,因为逗号和破折号似乎是到目前为止我所看到的示例中唯一的分隔符。 But I do very much prefer to get the right count. 但是我非常希望得到正确的计数。

I can write my own parser but I am curious whether there are any existing packages that can already do this out of the box or with minimal mods. 我可以编写自己的解析器,但很好奇是否有任何现成的程序包已经可以做到这一点或只需很少的mod。

Here's a solution that supports parsing "normal" numbers as well as roman numerals. 这是一个支持解析“正常”数字和罗马数字的解决方案。 For parsing roman numerals, install the roman package (easy_install roman). 要解析罗马数字,请安装罗马软件包(easy_install roman)。 You can enhance the parse_num function to support additional formats. 您可以增强parse_num函数以支持其他格式。

import roman

def parse_num(p):
    p = p.strip()
    try:
        return roman.fromRoman(p.upper())
    except:
        return int(p)

def parse_pages(s):
    count = 0
    for part in s.split(','):
        rng = part.split('-', 1)
        a, b = parse_num(rng[0]), parse_num(rng[-1])
        count += b - a + 1
    return count

>>> parse_pages('17')
1
>>> parse_pages('6-10, 19-22')
9
>>> parse_pages('xlvii-xlviii')
2
>>> parse_pages('XC-CIII')
14

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM