[英]Parsing text from a table of contents using regex
以下是我要解析的文本,存儲在名為“ toc”的變量中
Table of Contents
I. INTRODUCTION .................................... 1
II. FACTUAL ASPECTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A. The Clean Air Act . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
B. EPA's Gasoline Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1. Establishment of Baselines . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Reformulated Gasoline . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Conventional Gasoline (or "Anti-Dumping Rules") . . . . . . . . 4
C. The May 1994 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
III. MAIN ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
A. General .................................... 5
B. The General Agreement on Tariffs and Trade . . . . . . . . . . . . . . . . 6
1. Article I - General Most-Favoured-Nation Treatment . . . . . . . 6
2. Article III - National Treatment on Internal Taxation
and Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
a) Article III:4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
b) Article III:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3. Article XX - General Exceptions . . . . . . . . . . . . . . . . . . . . 15
4. Article XX(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
a) "Protection of Human, Animal and Plant Life
or Health" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
b) "Necessary" . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5. Article XX(d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6. Article XX(g) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
a) "Related to the conservation of exhaustible natural
resources..." . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
b) "... made effective in conjunction with restrictions
on domestic production or consumption" . . . . . . . . . . 23
7. Preamble to Article XX . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8. Article XXIII - Nullification and Impairment . . . . . . . . . . . . 25
我想要這樣的結果:
['I.INTRODUCTION ...... 1', 'A. The Clean Air Act ....3', 'B. EPA\'s Gasoline Rule ... 3', (AND_SO_ON) ]
INPUT:
re.search(r"((?<=(\n))\s+(?P<name>[A-Z \.]*?)(\n))", toc_s).group()
OUTPUT:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-64-4aa240f6e378> in <module>()
----> 1 re.search(r"((?<=(\n))\s+(?P<name>[A-Z \.]*?)(\n))", toc_s).group()
AttributeError: 'NoneType' object has no attribute 'group'
我怎么了
假設整個TOC內容都位於多行字符串text
。 您可以在啟用re.MULTILINE
開關的情況下使用re.findall
或re.finditer
。
for match in re.finditer('(.*?)[\W]+(\d+)(?=\n|$)', text, flags=re.M):
chapter, page = map(str.strip, match.groups())
... # do something with these
要么,
contents = re.findall('(.*?)[\W]+(\d+)(?=\n|$)', text, flags=re.M)
沿着這些路線返回的東西-
[('I. INTRODUCTION', '1'),
('II. FACTUAL ASPECTS', '2'),
(' A. The Clean Air Act', '3'),
(" B. EPA's Gasoline Rule", '3'),
(' 1. Establishment of Baselines', '3'),
(' 2. Reformulated Gasoline', '4'),
...
]
2元組的列表。 每個元組都有a)章節和b)相應的頁碼。 如果某行與該模式不匹配,則將其忽略。
細節
該模式非常具體,需要反復試驗。
( # first capture group - the chapter name
.*? # non-greedy match
)
[\W]+ # match characters that are not alphanumeric
( # second capture group - the page number
\d+ # one or more digits
)
(?= # lookahead for a newline or EOL (multiline)
\n # literal newline
| # regex OR
$ # EOL
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.