简体   繁体   中英

Read in a file and skip the header portion of a text file in Python

I've taken a book in text format from gutenberg.org and I'm trying to read in the text, but skip the beginning portion of the file, and then use a process function I've written to parse the rest. How can I do this?

Here is the start of the text file.

> The Project Gutenberg EBook of The Kama Sutra of Vatsyayana, by Vatsyayana

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: The Kama Sutra of Vatsyayana
       Translated From The Sanscrit In Seven Parts With Preface,
       Introduction and Concluding Remarks

Author: Vatsyayana

Translator: Richard Burton
            Bhagavanlal Indrajit
            Shivaram Parashuram Bhide

Release Date: January 18, 2009 [EBook #27827]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***




Produced by Bruce Albrecht, Carla Foust, Jon Noring and
the Online Distributed Proofreading Team at
http://www.pgdp.net

and my code that currently processes the whole file.

import string

def process_file(filename):
    """ opens a file and passes back a list of its words"""
    h = dict()
    fin = open(filename)
    for line in fin:
        process_line(line, h)
    return h

def process_line(line, h):
    line = line.replace('-', ' ')

    for word in line.split():
        word = word.strip(string.punctuation + string.whitespace)
        word = word.lower()

        h[word] = h.get(word,0)+1

Add this:

for line in fin:
   if "START OF THIS PROJECT GUTENBERG BOOK" in line:
       break

just before your own "for line in fin:" loop.

Well, you can just read input until you match your criteria to skip the beginning:

def process_file(filename):
    """ opens a file and passes back a list of its words"""
    h = dict()
    fin = open(filename)

    for line in fin:
        if line.rstrip() == "*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***":
            break

    for line in fin:
        process_line(line, h)

    return h

Note that I used line.rstrip() == "*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***" in this example as a criterium, but you can perfectly fine set your own.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM