简体   繁体   中英

If the next line of a file contains a string, append it to the end of the current one

I have a CSV with 13 million lines. The data is not quote encapsulated and it contains newlines, which is causing a row of data to have line breaks. The data does not have multiple breaks per line, only one.

How would I take data like this?

Line of data
Line of data
 continuation of previous line of data
Line of data
Line of data
 continuation of previous line
Line of data

And turn it into this:

Line of data
Line of data continuation of previous line of data
Line of data
Line of data continuation of previous line
Line of data

I've tested this by storing the line in a variable and processing the next one, looking for the first character to be anything but 'L', and appending it. I've also tried using f.tell() and f.seek() to move around in the file, but I haven't been able to get it to work.

Assuming every time a line starts with a space it should be concatenated with the preceding line, this should work:

with open(data) as infile:
    previous_line = None
    for line in infile:
        if previous_line is None:
            previous_line = line
        if line.startswith(' '):
            line = previous_line.strip() + line
        previous_line = line
        print(line.strip())

Here's a cheap, reasonably efficient continuation line joiner for you.

def cont_lines(source):
    last_line = ''
    for line in source:
        if line.startswith(' '):
            last_line += line.lstrip()  # append a continuation
        else:
            if last_line:
                yield last_line
            last_line = line
    if last_line:  # The one remaining as the source has ended.
        yield last_line

Use like this:

with open("tile.csv") as f:
  for line in cont_lines(f):
     # do something with line

It only uses as much memory as the longest set of continuation lines in your file.

I was able to work out something.

infile = "test.txt"
def peek_line(f):
    pos = f.tell()
    line = f.readline()
    f.seek(pos)
    return line

f = open(infile, 'r')
while True:
    line = f.readline()
    if not line:
        break
    peek = peek_line(f)
    if not peek.startswith('T'):
        line = (line.strip() + f.readline())
    print line,

I'm open to feedback on this method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM