简体   繁体   中英

Bash or Python for extracting blocks from text files

I have a huge text file, which is structured as:

SEPARATOR
STRING1
(arbitrary number of lines)
SEPARATOR
...
SEPARATOR
STRING2
(arbitrary number of lines)
SEPARATOR
SEPARATOR
STRING3
(arbitrary number of lines)
SEPARATOR
....

What only changes between the different "blocks" of the file is the STRING and the content between the separator. I need to get a script in bash or python which given a STRING_i in the input, gives as output a file, which contains

SEPARATOR
STRING_i
(number of lines for this string)
SEPARATOR

What is the best approach here to use bash or python? Another option? It must also be fast.

Thanks

In Python 2.6 or better:

def doit(inf, ouf, thestring, separator='SEPARATOR\n'):
  thestring += '\n'
  for line in inf:
    # here we're always at the start-of-block separator
    assert line == separator
    blockid = next(inf)
    if blockid == thestring:
      # found block of interest, use enumerate to count its lines
      for c, line in enumerate(inf):
        if line == separator: break
      assert line == separator
      # emit results and terminate function
      ouf.writelines((separator, thestring, '(%d)' % c, separator))
      inf.close()
      ouf.close()
      return
    # non-interesting block, just skip it
    for line in inf:
      if line == separator: break

In older Python versions you can do almost the same, but change the line blockid = next(inf) to blockid = inf.next() .

The assumptions here are that the input and output files are opened by the caller (which also passes in the interesting values of thestring , and optionally separator ) but it's this function's job to close them (eg for maximum ease of use as a pipeline filter, with inf of sys.stdin and ouf of sys.stdout ); easy to tweak if needed of course.

Removing the assert s will speed it up microscopically, but I like their "sanity checking" role (and they may also help understand the logic of the code flow).

Key to this approach is that a file is an iterator (of lines) and iterators can be advanced in multiple places (so we can have multiple for statements, or specific "advance the iterator" calls such as next(inf) , and they cooperate properly).

I would use Python and write something similar to this:

import sys

file = open("file", "r")
counter = 0
count = False
for line in file:
  if count:
    counter += 1
  if count and SEPARATOR == line:
    break
  if not count and sys.argv[1] == line:
    count = True
print SEPARATOR, sys.argv[1], counter, SEPARATOR
file.close()

If you want this to be fast, you need to avoid reading the entire file to find the block of data you need.

  1. read over the file once and store an index of a) byte offset for the start of each STRING_I and b) length (bytes) of block - distance to the next SEPARATOR in bytes. You can store this index in a separate file or in a 'header' of the current file
  2. for each STRING_I query - read in index
if STRING_I in index:
     file.seek( start_byte_location )
     file.read( length )
     return parse_with_any_of_procedures_above # like @gruszczy's doit() but w/o loop

don't go overboard with the index: use a dict of STRING_I -> ( location,length), and just simplejson / pickle it out to a file

you can use (g)awk, which is a relatively fast tool to process files.

read -p "Enter input: " input
awk -vinput="$input" -vRS="SEPARATOR" '$0~input{ printf RT; print $0; printf RT }' file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM