I'm working on a project that involves creating a database of US federal code in a certain format. I've obtained the whole code form official source which is not structured well. I have managed to scrape the US Code in the below format into text files using some code on GITHUB.
-CITE-
13 USC Sec. 1 1/15/2013
-EXPCITE-
TITLE 13 - CENSUS
CHAPTER 1 - ADMINISTRATION
SUBCHAPTER I - GENERAL PROVISIONS
-HEAD-
Sec. 1. Definitions
-STATUTE-
As used in this title, unless the context requires another
meaning or unless it is otherwise provided -
(1) "Bureau" means the Bureau of the Census;
(2) "Secretary" means the Secretary of Commerce; and
(3) "respondent" includes a corporation, company, association,
firm, partnership, proprietorship, society, joint stock company,
individual, or other organization or entity which reported
information, or on behalf of which information was reported, in
response to a questionnaire, inquiry, or other request of the
Bureau.
-SOURCE-
(Aug. 31, 1954, ch. 1158, 68 Stat. 1012; Pub. L. 94-521, Sec. 1,
Oct. 17, 1976, 90 Stat. 2459.)
-MISC1-
<some text>
-End-
-CITE-
13 USC Sec. 2 1/15/2013
-EXPCITE-
TITLE 13 - CENSUS
CHAPTER 1 - ADMINISTRATION
SUBCHAPTER I - GENERAL PROVISIONS
-HEAD-
Sec. 2. Bureau of the Census
-STATUTE-
The Bureau is continued as an agency within, and under the
jurisdiction of, the Department of Commerce.
-SOURCE-
(Aug. 31, 1954, ch. 1158, 68 Stat. 1012.)
-MISC1-
<some text>
-End-
Each text file contains thousands of such blocks starting with a -CITE- tag and ending with an -END-.
Apart from these there are certain blocks which represent the start of a chapter or sub chapter and these do not contain a -STATUTE- tag.
Eg
-CITE-
13 USC CHAPTER 3 - COLLECTION AND PUBLICATION OF
STATISTICS 1/15/2013
-EXPCITE-
TITLE 13 - CENSUS
CHAPTER 3 - COLLECTION AND PUBLICATION OF STATISTICS
-HEAD-
CHAPTER 3 - COLLECTION AND PUBLICATION OF STATISTICS
-MISC1-
SUBCHAPTER I - COTTON
Sec.
41. Collection and publication.
42. Contents of reports; number of bales of linter;
distribution; publication by Department of
Agriculture.
43. Records and reports of cotton ginners.
SUBCHAPTER II - OILSEEDS, NUTS, AND KERNELS; FATS, OILS, AND
GREASES
61. Collection and publication.
62. Additional statistics.
63. Duplicate collection of statistics prohibited; access
to available statistics.
SUBCHAPTER III - APPAREL AND TEXTILES
81. Statistics on apparel and textile industries.
SUBCHAPTER IV - QUARTERLY FINANCIAL STATISTICS
91. Collection and publication.
SUBCHAPTER V - MISCELLANEOUS
101. Defective, dependent, and delinquent classes; crime.
102. Religion.
103. Designation of reports.
AMENDMENTS
<some text>
-End-
I am interested only in those blocks that have a -STATUTE- tag.
Is there a way to extract only the blocks of text that have the -STATUTE- tag and write them to another text file?
I'm new to Python but I'm told this can be easily done in Python.
Appreciate if someone could guide me with this.
So, for each line, if it starts with a hyphen, followed by some upper-case text, followed by another hyphen, then it's a marker that notes that we're in a new section of some sort. This can be done using a regular expression:
current_section_type = None
r= re.compile("^-([A-Z]*)-")
for line in f.readlines():
m=r.match(line)
if m:
current_section_type = m.group(1)
else:
if current_section_type == "STATUTE":
print line.strip()
I'd read the text line-by-line and parse it myself. This way you can handle large input as streams. There are nicer solutions using multiline regexps but those will always suffer from being not able to handle the input as a stream.
#!/usr/bin/env python
import sys, re
# states for our state machine:
OUTSIDE = 0
INSIDE = 1
INSIDE_AFTER_STATUTE = 2
def eachCite(stream):
state = OUTSIDE
for lineNumber, line in enumerate(stream):
if state in (INSIDE, INSIDE_AFTER_STATUTE):
capture += line
if re.match('^-CITE-', line):
if state == OUTSIDE:
state = INSIDE
capture = line
elif state in (INSIDE, INSIDE_AFTER_STATUTE):
raise Exception("-CITE- in -CITE-??", lineNumber)
else:
raise NotImplementedError(state)
elif re.match('^-End-', line):
if state == OUTSIDE:
raise Exception("-End- without -CITE-??", lineNumber)
elif state == INSIDE:
yield False, capture
state = OUTSIDE
elif state == INSIDE_AFTER_STATUTE:
yield True, capture
state = OUTSIDE
else:
raise NotImplementedError(state)
elif re.match('^-STATUTE-', line):
if state == OUTSIDE:
raise Exception("-STATUTE- without -CITE-??", lineNumber)
elif state == INSIDE:
state = INSIDE_AFTER_STATUTE
elif state == INSIDE_AFTER_STATUTE:
raise Exception("-STATUTE- after -STATUTE-??", lineNumber)
else:
raise NotImplementedError(state)
if state != OUTSIDE:
raise Exception("EOF in -CITE-??")
for withStatute, cite in eachCite(sys.stdin):
if withStatute:
print "found cite with statute:"
print cite
In case you want to process not sys.stdin
you can do it like this:
with open('myInputFileName') as myInputFile, \
open('myOutputFileName', 'w') as myOutputFile:
for withStatute, cite in eachCite(myInputFile):
if withStatute:
myOutputFile.write("found cite with statute:\n")
myOutputFile.write(cite)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.