简体   繁体   中英

Extracting parts of text between specific delimiters from a large text file with custom delimiters and writing it to another file using Python

I'm working on a project that involves creating a database of US federal code in a certain format. I've obtained the whole code form official source which is not structured well. I have managed to scrape the US Code in the below format into text files using some code on GITHUB.

-CITE-
    13 USC Sec. 1                                               1/15/2013

-EXPCITE-
    TITLE 13 - CENSUS
    CHAPTER 1 - ADMINISTRATION
    SUBCHAPTER I - GENERAL PROVISIONS

-HEAD-
    Sec. 1. Definitions

-STATUTE-
      As used in this title, unless the context requires another
    meaning or unless it is otherwise provided - 
        (1) "Bureau" means the Bureau of the Census;
        (2) "Secretary" means the Secretary of Commerce; and
        (3) "respondent" includes a corporation, company, association,
      firm, partnership, proprietorship, society, joint stock company,
      individual, or other organization or entity which reported
      information, or on behalf of which information was reported, in
      response to a questionnaire, inquiry, or other request of the
      Bureau.

-SOURCE-
    (Aug. 31, 1954, ch. 1158, 68 Stat. 1012; Pub. L. 94-521, Sec. 1,
    Oct. 17, 1976, 90 Stat. 2459.)


-MISC1-
                      <some text>

-End-


-CITE-
    13 USC Sec. 2                                               1/15/2013

-EXPCITE-
    TITLE 13 - CENSUS
    CHAPTER 1 - ADMINISTRATION
    SUBCHAPTER I - GENERAL PROVISIONS

-HEAD-
    Sec. 2. Bureau of the Census

-STATUTE-
      The Bureau is continued as an agency within, and under the
    jurisdiction of, the Department of Commerce.

-SOURCE-
    (Aug. 31, 1954, ch. 1158, 68 Stat. 1012.)


-MISC1-
                      <some text>

-End-

Each text file contains thousands of such blocks starting with a -CITE- tag and ending with an -END-.

Apart from these there are certain blocks which represent the start of a chapter or sub chapter and these do not contain a -STATUTE- tag.

Eg

-CITE-
    13 USC CHAPTER 3 - COLLECTION AND PUBLICATION OF
           STATISTICS                                      1/15/2013

-EXPCITE-
    TITLE 13 - CENSUS
    CHAPTER 3 - COLLECTION AND PUBLICATION OF STATISTICS

-HEAD-
           CHAPTER 3 - COLLECTION AND PUBLICATION OF STATISTICS       


-MISC1-
                           SUBCHAPTER I - COTTON                       
    Sec.                                                     
    41.         Collection and publication.                           
    42.         Contents of reports; number of bales of linter;
                 distribution; publication by Department of
                 Agriculture.                                         
    43.         Records and reports of cotton ginners.                

       SUBCHAPTER II - OILSEEDS, NUTS, AND KERNELS; FATS, OILS, AND
                                  GREASES
    61.         Collection and publication.                           
    62.         Additional statistics.                                
    63.         Duplicate collection of statistics prohibited; access
                 to available statistics.                             

                   SUBCHAPTER III - APPAREL AND TEXTILES               
    81.         Statistics on apparel and textile industries.         

              SUBCHAPTER IV - QUARTERLY FINANCIAL STATISTICS          
    91.         Collection and publication.                           

                       SUBCHAPTER V - MISCELLANEOUS                   
    101.        Defective, dependent, and delinquent classes; crime.  
    102.        Religion.                                             
    103.        Designation of reports.                               

                                AMENDMENTS                            
      <some text>

-End-

I am interested only in those blocks that have a -STATUTE- tag.

Is there a way to extract only the blocks of text that have the -STATUTE- tag and write them to another text file?

I'm new to Python but I'm told this can be easily done in Python.

Appreciate if someone could guide me with this.

So, for each line, if it starts with a hyphen, followed by some upper-case text, followed by another hyphen, then it's a marker that notes that we're in a new section of some sort. This can be done using a regular expression:

current_section_type = None
r= re.compile("^-([A-Z]*)-")
for line in f.readlines():
  m=r.match(line)
  if m:
    current_section_type = m.group(1)
  else:
    if current_section_type == "STATUTE":
      print line.strip()

I'd read the text line-by-line and parse it myself. This way you can handle large input as streams. There are nicer solutions using multiline regexps but those will always suffer from being not able to handle the input as a stream.

#!/usr/bin/env python

import sys, re

# states for our state machine:
OUTSIDE = 0
INSIDE = 1
INSIDE_AFTER_STATUTE = 2

def eachCite(stream):
  state = OUTSIDE
  for lineNumber, line in enumerate(stream):
    if state in (INSIDE, INSIDE_AFTER_STATUTE):
      capture += line
    if re.match('^-CITE-', line):
      if state == OUTSIDE:
        state = INSIDE
        capture = line
      elif state in (INSIDE, INSIDE_AFTER_STATUTE):
        raise Exception("-CITE- in -CITE-??", lineNumber)
      else:
        raise NotImplementedError(state)
    elif re.match('^-End-', line):
      if state == OUTSIDE:
        raise Exception("-End- without -CITE-??", lineNumber)
      elif state == INSIDE:
        yield False, capture
        state = OUTSIDE
      elif state == INSIDE_AFTER_STATUTE:
        yield True, capture
        state = OUTSIDE
      else:
        raise NotImplementedError(state)
    elif re.match('^-STATUTE-', line):
      if state == OUTSIDE:
        raise Exception("-STATUTE- without -CITE-??", lineNumber)
      elif state == INSIDE:
        state = INSIDE_AFTER_STATUTE
      elif state == INSIDE_AFTER_STATUTE:
        raise Exception("-STATUTE- after -STATUTE-??", lineNumber)
      else:
        raise NotImplementedError(state)
  if state != OUTSIDE:
    raise Exception("EOF in -CITE-??")

for withStatute, cite in eachCite(sys.stdin):
  if withStatute:
    print "found cite with statute:"
    print cite

In case you want to process not sys.stdin you can do it like this:

with open('myInputFileName') as myInputFile, \
     open('myOutputFileName', 'w') as myOutputFile:
  for withStatute, cite in eachCite(myInputFile):
    if withStatute:
      myOutputFile.write("found cite with statute:\n")
      myOutputFile.write(cite)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM