Hi I'm using pandas to read in a series of files and concatenate them to a dataframe. My files have a bunch of garbage at the beginning, of variable length, that I want to ignore. pd.read_csv()
has the skiprows method. I've written a function to handle this case, but I have to open the file twice to make it work. Is there a better way?
HEADER = '#Start'
def header_index(file_name):
with open(file_name) as fp:
for ind, line in enumerate(fp):
if line.startswith(HEADER):
return ind
for row in directories:
path2file = '%s%s%s' % (path2data, row, suffix)
myDF = pd.read_csv(path2file, skiprows=header_index(path2file), header=0, delimiter='\t')
Any help would be greatly appreciated.
This would now be possible (don't know if it was possible back then) as follows:
pos= 0
oldpos = None
while pos != oldpos: # make sure we stop reading, in case we reach EOF
line= fp.readline()
if line.startswith(HEADER):
# set the read position to the start of the line
# so pandas can read the header
fp.seek(pos)
break
oldpos= pos
pos= fp.tell() # renenber this position as sthe start of the next line
pd.read_csv(fp, ...your options here...)
Since read_csv()
also accepts a file like object, you can skip the leading junk lines before passing that object --- instead of passing the file name.
Example:
Replace
df = pd.read_csv(filename, skiprows=no_junk_lines(filename), ...)
with:
def forward_csv(f, prefix):
pos = 0
while True:
line = f.readline()
if not line or line.startswith(prefix):
f.seek(pos)
return f
pos += len(line.encode('utf-8'))
df = pd.read_csv(forward_csv(open(filename), HEADER), ...)
Notes:
readline()
returns the empty string when EOF is reached tell()
for keeping track of the position saves some lseek
system callsforward_csv()
assumes that your input file is encoded in ASCII or UTF-8 - if it isn't you have to adjust this line
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.