Python Pandas read csv file with variable preamble length

Question

Hi I'm using pandas to read in a series of files and concatenate them to a dataframe. My files have a bunch of garbage at the beginning, of variable length, that I want to ignore. pd.read_csv() has the skiprows method. I've written a function to handle this case, but I have to open the file twice to make it work. Is there a better way?

HEADER = '#Start'

def header_index(file_name):
    with open(file_name) as fp:
        for ind, line in enumerate(fp):
            if line.startswith(HEADER):
                return ind

for row in directories:
    path2file = '%s%s%s' % (path2data, row, suffix)
    myDF = pd.read_csv(path2file, skiprows=header_index(path2file), header=0, delimiter='\t')

Any help would be greatly appreciated.

Answer 1

This would now be possible (don't know if it was possible back then) as follows:

pos= 0
oldpos = None

while pos != oldpos:  # make sure we stop reading, in case we reach EOF
    line= fp.readline()
    if line.startswith(HEADER):
        # set the read position to the start of the line
        # so pandas can read the header
        fp.seek(pos)
        break
    oldpos= pos
    pos= fp.tell()    # renenber this position as sthe start of the next line

pd.read_csv(fp, ...your options here...)

Answer 2

Since read_csv() also accepts a file like object, you can skip the leading junk lines before passing that object --- instead of passing the file name.

Example:

Replace

df = pd.read_csv(filename, skiprows=no_junk_lines(filename), ...)

with:

def forward_csv(f, prefix):
    pos = 0
    while True:
        line = f.readline()
        if not line or line.startswith(prefix):
            f.seek(pos)
            return f
        pos += len(line.encode('utf-8'))

df = pd.read_csv(forward_csv(open(filename), HEADER), ...)

Notes:

readline() returns the empty string when EOF is reached
not invoking tell() for keeping track of the position saves some lseek system calls
the last line of forward_csv() assumes that your input file is encoded in ASCII or UTF-8 - if it isn't you have to adjust this line

Python Pandas read csv file with variable preamble length

Question

2 answers

solution1
0 2019-07-24 15:29:08

solution2
0 2020-11-01 17:52:20

Python Pandas read csv file with variable preamble length

Question

2 answers

solution1 0 2019-07-24 15:29:08

solution2 0 2020-11-01 17:52:20

solution1
0 2019-07-24 15:29:08

solution2
0 2020-11-01 17:52:20